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Systems on silicon show a continuous increase in complexity due to the ever 
increasing need for implementing new features and improvements of existing 
functions. This is enabled by the increasing density with which components can be 
integrated on an integrated circuit At the same time the clock speed at which circuits 
are operated tends to increase too. The higher clock speed in combination with the 
increased density of components has reduced the area which can operate 
synchronously within the same clock domain. This has created the need for a modular 
approach. According to such an approach the processing system comprises a plurality 
of relatively independent, complex modules. In conventional processing systems the 
systems modules usually communicate to each other via a bus. As the number of 
modules increases however, this way of communication is no longer practical for the 
following reasons. On the one hand the large number of modules forms a too high bus 
load. On the other hand the bus forms a communication bottleneck as it enables only 
one device to send data to the bus. A communication network forms an effective way 
to overcome these disadvantages. The communication network comprises a plurality 
of partly connected nodes. Messages from a module are redirected by the nodes to one 
or more other nodes. To that end the message comprises first information indicative 
for the location of the addressed module(s) within the network. The message may 
fiirther include second information indicative for a particular location within the 
module, such as a memory, or a register address. The second information may invoke 
a particular response of die addressed module. 

K is an object of the invention to provide an integrated circuit and a method 
according to the introductory paragraph, which provides the modules therein a 
relatively simple way of issuing messages. 

In order to achieve said object the integrated circuit is characterized by the 
characterizing portion of claim 1. 

In the integrated circuit according to the invention modules can issue messages in a 
simple way, by using a single address. This makes it possible for a module to perform 
a write action to a particular memory address without being aware of the destination 
which comprises said address is stored. 

In this way the network appears to the model issuing the message as a bus. This 
makes it relatively simple to incorporate already existing modules designed for a bus 
like architecture in an integrated circuit according to the invention. 

As such, processing systems are known, where a processor is coupled via a bus to 
various memories, which each are mapped onto a respective portion of the total 
address «mge. Byway of example a ROM and a RAM may be mapped to a first and a 
second address range respectively. When the processor performs a read instruction, 
the address in the instruction defines at the same time which memory is selected to 
read the data from. 
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la such known processing systems each of the various modules, such as memories are 
directly coupled to the bus. In the integrated circuit according to the invention, 
selecting one of lie modules implies that the one or other memories are set in a state 
wherein they do not Interfere with the bus traffic. Apart, from the memory that is 
addressed no other module is required to perform an action (in iact, they don't have to 
and don't need to know that another module is active - i.e. they don't have to be 'set 
in a state'), or 2) that multiple concurrent and/or pipelined messages can be active 
simultaneously in the network as a whole. In an integrated circuit according to the 
invention however, information issued by the active module is transferred as a 
message via one or more nodes of the network. As a consequence it follows a 
different route through the network depending on the address. This route is scheduled 
by the network. 

Examples of the two pieces of information that are arranged as a single address are: 
Single logical memory space/map/range mapped to multiple distributed memories 
each with their own physical memory ranges. 

Virtual memory space mapped to a single logical memory space (distributed or not), 
Multiple memory spaces/maps/ranges mapped to multiple distributed memories. For 
2) and 3) two translations may take place (vm logical -> physical, and multiple -> 
single ->physfoaD. 

The integrated circuit of claim 3 and the method of claim 4 provide another way of 
improving data transfer in an integrated circuit comprising a plurality of modules 
connected by a network. 

Theoretically a transaction could comprise any number of outgoing and/or return 
messages. In practice however a transaction is made up of one or two outgoing 
messages (from the first to the second module), and zero, one, or two return messages 
(from the second to the first module). By managing the outgoing messages in a way 
different from the return messages the overall efficiency of the network and therewith 
the integrated circuit comprising the network is improved. This is further illustrated 
with the following embodiments. 

With reference to claim 5 it is remarked that OT connections can overbook resources 
in some cases. For example, when an ANIP opens a GT read connection, it must 
reserve slots for the read command messages, and for the read data messages. The 
ratio between the two can be very large (e.g,> 1:100), which leads either to large slot 
tables, or bandwidth being wasted for the read command messages. In order to 
prevent as much as possible that a reservation for guaranteed traffic would impede 
other transactions the bandwidth which can be reserved should be restricted. On the 
other hand the best effort traffic may use any resources which are currently available. 
As a consequence guaranteed traffic has bounded but on average higher latency than 
best-effort traffic which has no fixed upper bound, but is (or should be) fester on 
average. 

Based on this recognition it has been found that the overall quality of the network 
transport could be improved by exploiting BE packets for read command 
messages, and GT packets for read data messages. No guarantees can be offered in 
this case, but the overall throughput can be higher and more stable jftsm in the case of 
using only BE packets. 
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With reference to claim 6 it is remarked that preferably the outgoing transactions are 
handled in a locally ordered and the retain transactions in a globally ordered 
transaction mode. The one or more adressed modules process the transactions in the 
order they have been issued, and the return part of the transactions are all delivered to 
the first module in the order in which it initiated the transactions. Even if ordered 
channels axe used, the responses from different addressed modules (e.g., in a narrow 
cast connection) must be sorted at the first module, This kind of ordering conforms 
with AMBA 



To implement global ordering, transactions that are delivered to different second 
modules (also referred to as slave) must be ordered exactly as they were sent by the 
first module (also referred to as master) . This means that the network should either 
have a global time indicator, and use e,g. deadline-based scheduling in the network 
while in addition assumption on the consumption time of the second models must be 
available. An alternatively way to introduce global ordering is to introduce explicit 
dependencies between transactions. The latter can be done by using 
acknowledged/tagged transactions, where proof of delivery to the slave is sent bade to 
the master using an acknowledgement message. This solution, however, introduces 
extra latency because transactions are sequentialised with a round-trip delay/latency 
per transaction, (send a message, wait for the acknowledgement, send next message, 
wait for next acknowledgement, etc.). By requiring only a local ordering for the 
delivery of the outgoing transactions, the slaves, provided that they are autonomous 
(which is usually the case) can execute messages independently. 

With reference to claim 7 it is remarked that in this way buffer space is used in 
an efficient way. A particular example is an embodiment wherein a large buffer 
space is reserved for the buffer of the network interface coupled to an active 
module, such as a module isuing a read command) and a small buffer space is 
reserved for the buffer of the network interface coupled to a passive module, e*g, 
the one receiving the read message. 

In other situations there may be different types of flow control (e.g. you never want 
to lose write commands, but don't mind losing read data). If a module can do both 
read and write commands, it may be important that write transactions always succeed 
(e.g.*when writing to an interrupt controller), but that read transactions are not critical 
because they can be retried (so the GMD of the read transaction is dropped and the 
read never executed, or the RETDATA is dropped after the read has been executed- 
Another example is that if you know that writes always succeed if they are delivered, 
a flow-controlled connection is requested, Acknowledgements are not necessary in 
that case; Without flow control acknowledgements are compulsory, complicating the 
master and causing additional traffic. 

In the integrated circuit according to the invention the decision to drop messages or 
not is not decided per transaction but for the outgoing and return parts of connection 
as a whole- For example all outgoing messages having the format reads+address or 
writes+address+data) may be guaranteed lossless, while for all return messages 
(whether read data, write acknowledgements) packets may be dropped. 

A connection could be opened as follows: 
conoid = open ( 
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nofc/fc, 

outgoing unoidered/local/globaL, 
outgoing buffer size, 
return unordered/local/global, 
return buffer size); 

i,e. all outgoing messages have certain properties, and all return messages have 
certain properties. 

With reference to claim 8 it is remarked that in a processing system with modules 
working asynchronously with respect to each other it is usual that a module receiving 
data issues an acknowledge signal to inform the issuing processor that it has received 
a message. In case that a message is multicast a plurality of said acknowledge signals 
is generated, which imposes a burden for the issuing processor. In the integrated 
circuit of the invention the first module receives only a single message, which reduces 
this burden This measure is based on the insight that the network usually can 
relatively easily generate the single return message in response to the plurality of 
acknowledge messages of the second modules as a side effect of the functions already 
present in the network for other purposes. 

Wife reference to claim 9: Depending on the situation the single return message can 
depend on the acknowled messages in various ways. The embodiment of claim 2 is 
favorable where the addressed second modules are memories, and the first module 
attempts to store data therein. In that case it is sufficient that only one copy of the data 
is really received and stored. 

With reference to claim 10: In other situations it is compulsory that each of the 

addressed second modules has received the data. In the embodiment of claim 10 the 

single return message is not generated until this is the case. 

Otherwise the returnn message could be combined as follows. 

If each of the write transaction has been successfully executed by all slaves, all will 

return RETSTATNRETOK, which can be combined by 

the ANIP in a single message to be delivered to the master. 

If the write transaction has been successfully executed only by some slaves, there 
will be a mix of RETSTATs (RBTOK and RETERROR). They can either be 
combined into 

(a) a single RETSTAT-RETERROR, to specify that an error occured, or 

(b) a single RETSTAT, but a larger one, more descriptive, encoding 
where there have been errors. All RETSTATs can be bundled together 
in a single RETSTAT for the master, or <slave identifiers,error code> 
pairs can be bundled to form a single RETSTAT for the master. 

If the connection has no flow control, messages can be dropped 

at the PNEPs, resulting also in RETSTAT=RETLOST messages. Again, combinations 

as those above can be made. 



e.uoi.cnoe lo-KO fniuxra w*. — 08.10.2002 18:0b:kJU 

PHNLG21031BPP 




08.10.2002 

With reference to claim 11: In this way it is guaranteed that the first module always 
receives a response to a transaction, even if the connection has no flow control (Le. 
data may be dropped). This is done by only dropping data in the FNJP (the network 
interface coupled to the second, receiving module), and returning a FAIL/ERROR to 
the ANIP (The network interface coupled to the first module). This return status 
(RETSTAT) message will never he dropped because the ANIP that initiated the 
transaction will reserve space for return messages of every transaction that it initiates. 
This combination of reserving space and generating an error message whenever a 
message is dropped is a way to introduce flow control. Preferably the RETSTAT 
message is generated by the interface of the receiving module, although alternatively 
it could be generated at the intermediary network nodes too. 
The method according to the invention fm«r*nt*** transaction completion. Le. it is 
always known whether an initiated transaction 

(a) was delivered and executed successfully at me slave (RETSTATOK produced by 
the slave), or 

(b) was never delivered at the slave (RETSTAT=REQLOST produced by the PNIP), 
or 

(c) was delivered at the slave, but not successfully executed (RETSTAT=ERROR 
produced by the slave), or 

(d) was delivered and executed successfully at the slave but the response message was 
dropped (RETSTAT=RETLOST produced by the ANIP). 

This is achieved by either 

(i) not dropping messages (flow-controlled connection), in this case RETSTAT is 
either OK or ERROR, or 

(n) by allowing messages to be dropped (on a connection without flow control), hut 
generating a RESTAT (REQLOST or RBTLOST) whenever the message is dropped, 
or a RETOK or RETERROR as usual when the message is not dropped. 

It is essential however, never to drop RETSTATs, because this completes the 
transaction.Tbis is realized in that a buffer for the RETSTAT is located at the master's 
ANIP The latter reserves space for RETSTATs when initiating transactions, and 
bounds the number of outstanding transactions (for finite sized RETSTAT birfjers). 

The flow control on the outgoing and return connections is in principle independent. 
Thus, for outgoing flow control A return flow control, the RETSTAT message is 
according to a) ore) above 

In cage 0 f outgoing flow control & no return flow control, the RETSTAT message is 

a) or c) or d) above. , 
In case of no outgoing flow control & return flow control, the RETSTAT message is 

a) orb) ore) above. 

Other embodiments are such an integrated circuit wherein the return message is a 
message indicating whether the second module has received a message from the first 
module. In this embodiment the return message can be very compact, e.g.. one or two 
bits to indicate one of the four options described above. 
Alternatively or in addition a return message ccmprises an identification of the 
message received by the second module. 

1. I suggest "efficiency" instead of "performance", because perfonnaacc is just one of Ike 
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factors. We amy have fee option to reduce the cost of the network (e.g., reduce buffer sizes), or 
increase the performance (e,g„ by adding mote connections for the same resources). 
Page: 8 

2. This is an example for the use of different properties for outgoing and return parts. However, 
mora can be defined: 

V Acknowledged write transaction: write command + outgoing data use guaranteed throughput 

(mode one in your example), and acknowledgment uses best effort (mode two in. your example) 
Moreover, except time-related guarantees, there is also a distinction on the buffering in both yon and 
the above example. Fox data message? there is potentially more buffering allocated than for commands 
and acknowledgments. Consequently, fbr a read transaction (your example) buffers for the returapart 
would be larger than those for the outgoing part, For thft acknowledged write (foe example above), 
buffers for the outgoing part axe larger, and those for ackno wledgmenta are smaller. 
Page: 8 

3. It is indeed possible to allocate different bandwidths as you suggest However, there are also 
limitations. We use a slot table, which contains a number of slots in a time window. Bandwidth is 
reserved allocating these slots to connections, For example, if we use a table with 100 slots for a time 
frame of lua, each slot Will be allocated for 1/100 from Ijis^ 10ns. If the network provides IGb/s per 
link, the bandwidth per slot will be 1/100 from lGb$ 83 lOMh/s. We can only allocate multiple of 
lOMb/s for guaranteed throughput traffic. 

For a read command generating long bursts, allocating die mmfnium bandwidth of 1 OMb/s would be 
probably to much, as it will use only a small fraction of it The bandwidth can indeed be used by best* 
effort traffic, however, not by other guaranteed throughput Traffic As a result, not all the traffic for 
which guarantees are needed may fit in the slot table. 

An alternative is to us e more slots* but ibis increas es the cost of the router- This is why, a best effort 
command maybe a better solution. 
Page: 8 

4. This rtftfiflfrirm is good for outgoing messages, as there is one source (AN30P) and potentially 
multiple destinations (PNlPs). However, for return messages, we define global/local ordering as 
follows. Global ordering means that responses from all PNIPs/slaves (Le. sources of messages in this 
case) com in the same order as the transactions have been initiated (Le., the same order as the 
commands have been issued by fee master to fee ANffV Local ordering guarantees the order of 
response only if they come from fee same slave/PNIP. 

5Page:8 

Slave module? 

Page: 8 

6. -We can only guarantee the order we offer transactions to fee slave module, out the order of 
processing depends on the module iraplemenrnTion, Itcan well decide to process transactions in a 
different order (e.g. s memory controller). For ordering we only require the responses are returned in 
the same order as the transactions were accepted. 

Page: 8 

7. This is only valid for global ordering. For local ordering (Le,, order preserved only per slave), 
if ordered transport channels are used, no sorting is necessary. . 

Page; 8 

8. Global ordering of responses conform wife AMB A, Local ordering of responses does not 
Page: 8 

9. I feink Kees meant write Transactions may be critical and we don't want to loose them, but 
read transactions can be lost, because they can be tried later. See example below in the text. 
Page: 8 

10. The two commands (Le., read and write) can indeed be gent from the same module. If we set 
up a connection with flow control for the outgoing part both commands will be delivered However, if 
fee return part has no flow control, fee responses for read c ftrnmands maybe lost. In such a case, fee 
read transactions win fefl. I think Kees meant read transactions being lost, not read commands being 
lost 

Page: 8 

11. ft- flow control, nofc = no flow control 
Page: 8 

1 % Buffer is reserved only for a return status message, such as an acknowledgment, or an error 
message. BuSer can be, but is not necessarily reserved alao for returned data. 
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13 Data can also be dropped at lie ANIP (Le., RETDATA) when no flow control is nimlemenled 
for' the return part. In. snob a case, a RETSTAT=RBTLQST will replace the RET9TAT=RET0K 
which accompanied the dropped RETDATA. 
Page: 8 

14. Has reserved 

8 Yes, this Is true. Between routers, there is always link level flow connol and no data is never 
lost Data can bo lost only in lie network inteifeces, if no end-to-end flow control (here referred ^ 
simply as flow control) is hnplementBd Therefore, here, messages reach the PNB? even when no (end- 
to-end) flow control is implemented. 



These and other aspects are described in more detail in me following three annexes 
X. Communieation Services for Networks on Chip, pages 1-25 by Andrei 
Radulescu and Kees Goossens; 

Former background information useful for implementing the invention, can be found 

2. Networks on Silicon; Blessing or Nightmare? pp 1-5. by Paul Wielage and 
Kees QoossenSi (published.), and 

3 Trade-Offe in the Design of a Router with Combined Guaranteed and Best- 
Effort Services for Networks on Chip, pp 1-6, by Edwin Rijpkema, Kees Goossens 
Andrei Radnlescu, Jef van Meerbergen, and Paul Wielage, submitted to and rejected 
byISSS2002. 
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t Introduction 



Networks on chip (Nop) have received considerable attention recently 
as a sohttloa to the interconnect problem in highly-complex chips [3-5,7- 
9, 15, 19, 22]. The reason is twofold. First, NoCs help resolve the electri- 
cal problems in new deep-subraicron technologies, as they structure and 
manage global wires [3-5, 7, 8J. At ihe same time they share wires, lower- 
ing their number and increasing their utilization [7, 8] . NoCs can also be 
<meigy efficient and reliable [4], and are scalable compared to buses [9]. 
Seconds NoCs also decouple computation from communication, which is 
essential in managing the design of billiou-transistor chips [14, 22}. NoCs 
achieve this decoupling because they are traditionally designed using pro* 
tocol stacks [21], which provide well-defined interfaces separating com- 
munication service usage from service implementation [5, 22]. 

Using networks for on-chip communication when designing systems on 
chip (SoC), however, raises a number of new issues that must be taken 
into account. This is because, in contrast to existing on-chip interconnects 
(e.g.y buses, switches, or point-to-point wires), where the communicating 
modules are directly connected, in a NoC die modules communicate re- 
motely via network nodes. As a result, interconnect arbitration changes 
from centralized to distributed, and issues like ouc-of order transactions, 
higher latencies* and end-to-end flow control must be handled either by 
file intellectual property block (if) or by the network itself. 

Most of these topics have been already the subject of research in the field 
of computer networks [24] and parallel machine interconnect networks [6]. 
However, on-chip networks have different properties (e.g., lighter link syn^ 
chronizarien) and constraints (e,g., higher memory cost) leading to differ- 
ent design choices, which in the end affect the network services. 

In this pape^ we compare NoCs and off-chip networks showing both 
their similarities and differences. We also explore the differences between 
NoCs and existing on-chip interconnects. We list new issues that must be 
resolved in system design due to the multi-hop nature of NoCs, and present 
an interface which takes these issues into consideration. Our interface are 
aimed at b eing similar to a sp lit- transaction bus interface, such as VCI [25] 
or OCP [17], to allow simple, low-cost wrappers to bus interfaces, and 
to allow backward compatibility with existing IPs. Our interface uses a 
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COMMUNICATION SERVICES FORPfOCS 3 

request-response protocol that provides bade read and 'mite operations, 
But our interface extends bus interfaces to fully exploit the power of our 
NoC [8, 19,20], For example, it offers connection-based communication 
where end-to-end flow control and time-related guarantees (e*g., bounded 
latency) can be requested. 

The paper i$ organized as follows . In the next two sections we compare 
NoCs properties with those of off-chip networks and buses, respectively. 
In Section IV, we present the services wa offer in our network. Finally, we 
present our conclusions. 
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XL Networks Brought on Chip 

Networks have been the subject of research for decades, both in the 
contract of local end wide area networks (computer networks) [24] 5 and as 
an interconnect for parallel machines [6]. Both are very miich related to qn^ 
chip networks, and many of the results in those fields are also applicable 
on chip. However, NoCs premises are different from off-chip networks, 
and, therefore, most of the network design choic es must be reevaluated, 

NoCs differ from off-chip networks mainly in their constraints and syn- 
chronization. Typically, most on-chip resources have much tighter con- 
straints compared to off-chip. Storage (Le^ memory) and computation re- 
sources are relatively more expensive, whereas the number of point-to- 
point links is larger on chip than off chip U] - 

Storage is expensive, because general-purpose on-chip memory, such as 
RAMs, occupy a large area. Having the memory distributed in the network 
components in relatively small sizes is even worse, as the of overhead area 
in the memory then beoomes dominant. 

Also computation for on-chip networks comes at a relatively high cost 
compared to off-chip networks. An off-chip network interface usually con- 
tains a dedicated processor to implement the protocol stack np to network 
layer or even higher; to off-load the host processor from the communica- 
tion processing. Including a dedicated processor in a network interface is 
not feasible on chip, as the size of the network interface will become com- 
parable to or larger than the IP to be connected to the network. Moreover, 
running the protocol stack on the I* itself may also be not feasible, be- 
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cause often these IPs have one dedicated function only, and do not have 
the capabilities to run a network protocol stack. 

The number of wires and pins to connect network components is an 
order of magnitude larger on chip than off chip [7], If they are not used 
massively for other purposes than NoC, they allow wide point-to-point in- 
terconnects (e.g., 300-bit Unto) [7, 15], This is not possible off-chip 7 whero 
links are relatively narrower. 8-16 bits. 

On-chip wires are also relatively short, allowing a much tighter syn- 
chronization than off chip, This allows a reduction in the buffer space in 
the routers because the communication can be done at a ^qIW granu- 
larity. In the current semiconductor technologies, wires are also fast and 
reliable, which allows simpler link-layer protocols (e.g., no need for er- 
ror correction, or retransmission). This also compensates for the lack of 
memory and computational resources. 

In the rest of the section, we list five network issues that have a direct 
impact on the NoC cost: reliable communication, deadlock, data ordering, 
network flow control and buffering strategy, and time-related guarantees. 
For each of them, we discuss the differences and similarities for on- and 
ofF-ehip networks. 

Reliable communication. A consequence of the tight on-chip re- 
source constraints is that the network components (Le., routers and net- 
work interfaces) must be fairly simple to TtrinhnizB computation and mem- 
ory requirements. Luckily, on-chip wires provide a reliable communication 
medium, which avoids the considerable overhead incurred by the offntfup 
networks for providing reliable communication. Data integrity can be pro- 
vided at low cost at the data link layer, However, data loss also depends 
on the network architecture, as in most computer networks data is sim- 
ply dropped if congestion occurs in the network [6,24]. On-chip, dropping 
data may lead to a too costly implementation of reliable commutation. 
We show below that a network where no data is dropped can lead to a much 
lower-C03t solution, at the peril of introducing the possibility of deadlock. 



Deadlock. Computer network topologies have generally an irregular 
(possibly dynamic) structure and bidirectional links, which can introduce 
buffer cycles, lb such topologies, packet dropping at the network nodes 
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may be required to avoid deadlocks. 

Deadlock can also be avoided without dropping data, for example, by 
introducing constraints either in die topology or muting. FaMxee Topolo- 
gies 1^ already been considered for NoCs, where deadlock is avoided by 
bouncing back packets In the network in case of overflow [9]. Tfle-based 
approaches to system design [7, 15,23] use mesh or torus network topolo- 
gies, where deadlock ean be avoidedtising, for example, a turn-model rout- 
ing algorithm [6]. , 

An alternative solution for deadlock in NoCs, which takes *to consider- 
ation that modules connecting to the network are either masters (initiating 
requests and receiving responses), or Slaves (receiving requests and send- 
ing backresponses), is to maintain separate virtual networks (with separate 

l[6 ]. 
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Data ordering, In a network, data sent from a source to a destina- 
tion may arrive out of order due to reordering in network nodes, following 
different routes, or retransmission after dropping. For off-chip networks 
our-of-order data delivery is typical. However; for NoCs where no data is 
dropped, data can be forced to follow the same path between a source and 
a destination (detemdnistic routing) with no reordering. This in-order data 
transportation requires less buffer space, and reordering modules are no 
longer necessary; 



Network flow control and bnffeoing strategy* Netwoik flow con- 
trol and buffering strategy have a direct impact on the memory utiliza- 
tion in the network. Wormhole routing requires only a flit buffer in the 
router, whereas store-and-forward and virtual-cut-throngh routing require 
at least die buffer space to accommodate a packet Consequently^ on chip, 
wormhole routing may be preferred over virtual-cuMhrougJi or store-and- 
forward routing. Similarly, input queuing may be a lower memory-cost al- 
ternative to virtual-output^uing or output-queuing buffering strategies, 
because it has fewer queues. Dedicated fifo memory structures at a low cost 
also enable on-chip usage of virtual-cut-through routing or virtual output 
queuing for abetter performance [19]. However; using virtualnOTMhrou^i 
rroting and virtual output queuing at the same time is still too costly [1 9J. 
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Figure 1. A network Figure 2. Abas 

interconnect example interconnect example 



Tuna-related guarantees. Off-chip networks typically use packet 
switching and offer best-effort services. Contention can occur at each net- 
work nods, nwknig latency guarantees very hard to offer. Throughput guar- 
antees can still be offered using scheme? such as rate-based switching [26] 
or deadline-based packet switching [18], but with high buffering costs. 

As alternative to provide such time-related guarantees is to use time- 
division multiple access (TDMA) circuits, where every circuit is dedicated 
to a network connection. Circuits provide guarantees at a relatively low 
memory and computation cost Network resource utilization is increased 
wW die network architecture allows any left-over guaranteed bandwidth 
to be used by best-effort communication £10, 19,20]. 



IIL From buses to NoGs 

Introducing networks (Figure I) as on-chip interconnects radically 
changes the communication when compared to direct interconnects, such 
as buses or switches (Figure 2). This is because of the multi-hop nature 
of a network, where communication modules 3re not directly connected, 
but separated by one or more network nodes. This is in contrast with the 
prevalent existing interconnects (i.e., buses) where modules are directly 
connected The implications of this change reside in the arbitration (which 
most change from centralized to distributed), and in tho communication 
properties (e,g. a ordering, or flow control). 
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In this section, we Est some of these topics, and outline fee differ- 
ences of NoCs and buses. We refer mainly to buses as direct intercon- 
nects, because currency they ate the most used on-chip interconnect. Most 
of the bus characteristics also hold for other direct interconnects (e-g., 
switches [16D- Multilevel buses are a hybrid between buses and NoCs. 
Depending on the functionality of the bridges, for our purposes, multilevel 
bases either behave like simple buses [2] or like NoCs. 
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Programming ModeL The programming model of a bus typically 
consists of load and store operations which are implemented as a se- 
quence of primitive bus transactions. Bus interfiles typically have dedi- 
cated groups of wires for command, address, write data, and read data [1, 
12,13,17,251. 

A bus is a resource shared by multiple IPs. Therefore, before using it, 
IPs must go through an arbitration phase, where they request access to the 
hus, and block until the bus is granted to them. 

A bus transaction involves a request and possibly a response. Modules 
issuing request? are called masters, and those serving request* are called 
slaves. If there is a single arbitration for a pair of requesMesponse, the 
bus is called non-split* In this case, the bus remains allocated to die master 
of to transaction until the response is delivered, even when this takes a 
long time. Alternatively, in a split bus, die bus is released after the request 
to allow transactions from different masters to be initiated. However, a 
new arbitration roost be performed for the response such that the slave can 
access the bus [11]. 

For both split and non-split buses, both conmranication parties have di- 
rect and immediate access to the stain* of the transaction. In contrast, net- 
work transactions are one-way transfers from an output buffer at the source 
to an input buffer at the destination thai causes some action at the destina- 
tion, the occurrence of which is not visible at die source [Q. The effects of 
a network transaction are observable only through additional transactions. 
A request-response type of operation is still possible, but requires at least 
two distinct network transactions. Thus, a bus-Eke transaction in a NoC 
will essentially fa e a split transaction. 
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"ftansaction Ordering, traditionally, on a bus all transactions are 
ordered (cf. Peripheral VCI [25], AMBA [1], or CoreCoimect PLB and 
OPB [12, 13]). This fa possible at a low cost, because die interconnect, 
being a direct link betveeea the communicating parties, does not reorder 
of data. However, on a split bus, a total ordering of transactions on a sin- 
gle master may still cause performance penalties, when slaves respond at 
different speeds, lb solve this problem, recent extensions to bus protocols 
allow transactions to be performed on connections. Ordering of transao* 
tions within a connection fa still preserved, but between connections there 
are no ordering constraints (ev&, OCP [17], or Basic VCI [25]). A few of 
the bus protocols allow out-of-order responses per connection in their ad- 
vanced modes (e.g., Advanced VCI [7S\\ but both requests and responses 
arrive St the destination in the same order as they were sent 

In a NoC, ordering becomes weaken Global ordering can only be pro- 
vide d at a very high cost due to the conflict between die distributed nature 
of the networks, and the requirement of a centralized arbitration necessary 
for global ordering. 

Even local ordering, between a source-destination pair, may be costly. 
Dam may arrive out of order if it is transported over multiple routes. In 
such cases, to stfll achieve an in-order delivery, data must be labeled with 
sequence numbers and reordered at the destination before being delivered. 

Atomic Chains of Transactions. An atomic chain of transactions is 
a sequence of transactions initialed by a single master that fa executed on 
a single slave exclusively. That is, other roasters are denied access to feat 
slave, once the first transaction in the chain claimed it, This mechanism is 
widely used to implements synchronization ir^ftoi^^Q bstween master 
modules (e.g., semaphores). 

On a bus, atomic operations can easily be implemented, as die central 
arbiter win either (a) lock the bus for exclusive use by the master request- 
ing the atomic chain, or (b) know not to grant access to a locked slave. 
In the former case, the time resources are locked is shorter because once 
a master has been granted access to a bus, it can quickly perform aH tfc<j 
transactions in the chain (no arbitration delay fa required for the subsequent 
transactions in the chain). Consequently, the locked slave and the bus can 
be opened up again in a short time. This approach fa used in AMBA, and 
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CoreConnect In the laser cases, the bus is not locked, and can still be used 
by other modules, however, at the price of a longer locking time of the 
slave. This approached is used in Vd and OCR 

In a No C, where the arbitration is distribute d, masters do not know that 
a slave is locked- Therefore, transactions to a locked slaved ray still be 
initiated, even though the locked slave cannot accept them. Consequently, 
to prevent deadlock, these other transactions must be either dropped, or 
stored such that transactions in the atomic chain can be filtered and still be 
served. Moreover, the time a module is locked is much longer in case of 
NoCs, because of the higher latency per transaction. 

Deadlock. In buses, the deadlocks are not generally an issue. Dead- 
lock can still occur at the application level (e.g., an atomic chain of trans- 
actions that lodes the bus, which is never finished), but it is not caused by 
the interconnect itself. 

In a network, deadlock becomes a more important issue, and special 
care has to be taken in the network design to avoid deadlock. Deadlock is 
mainly caus ed by cycles in the buffers, To avoid deadlock, either network 
nodes must drop packets when their buffer are filled, or routing must be 
cycle-fiee. In a NoC, we believe the latter is preferable, because of its 
lower cost in achieving reliable communication (see Section H). 

A second cause of deadlock are atomic chains of transactions. The rea- 
son is thai while a module is locked, the queues storing transactions may 
got filled with transactions outsida the atomic transaction chain, blocking 
the access of the transaction in the chain to reach the locked module. If 
atomic transaction chains must be implemented (to be compatible with 
processors allowing this, such as MIPS), the network nodes should be able 
to filter the transactions in the atomic chain, or be allowed to drop those 
blocking them. 

Media Arbitration. An important difference between buses and 
NoCs is in fee media arbitration scheme. In a bus, master modules re- 
quest access to the interconnect, and the arbiter grants the access for the 
whole interconnect at once. Arbitration i* centralized as there is only one 
arbiter component, and global as all the requests as well as fee state of fee 
interconnect are visible to fee arbiter. Moreover; when a grant is given, the 
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complete path from the soiree to the destination is exclusively reserved. 

In a non^split bus, arbitration takes place once when a transaction is 
initiated. As a result, the bus is granted for both request and response. In a 
pplit bus, requests and responses are arbitrated separately, 

la a NoC arbitration is also necessary, as it is a abated interconnect 
However, in contrast to buses, toe arbitration is distributed, because it is 
performed in every router, and is based only on local information. Arbi- 
tration of the communication resources (links, buffers) is performed incre- 
mentally as the request or response advances [19], 

Destination Name and Routing: For a bug, the command, address, 
and data are broadcasted on the interconnect, They arrive at every destina- 
tion, of which one activates based on the broadcasted address, and execute 
the requested command This is possible because all modules are directly 
connected to the same bus. 

In a NoC, it is not feasible to broadcast information to all destinations, 
because it must be copied to all routers and network interfaces. This foods 
the network with data, The address is better decoded at die source to find a 
route to the destination module. A transaction address will therefore have 
two parts: (a) a destination identifier, and (b) an internal address at the 
destination. 

Latency. Transaction latency is caused by two factors; (a) the access 
time to the bus, which is the time until the bus is granted, and (b) the 
latency introduced by the interconnect to transfer the data. 

For a bus, whore the arbitration is centralized the access time is pro- 
portional to the number of masters connected to the bus. The transfer la* 
tency itself typically is constant and relatively fast, because the modules 
are linked directly. However, the speed of transfer is limited by the bos 
speed, which i$ relatively slow for buses. 

In a NoC, arbitration is performed at each router for the following link. 
~ The access time per router is small, Both end-to-ead access time and trans- 
port time increase proportionally to the number of hops between master 
and slave. However; network links axe unidirectional and point to point, 
and hence can nm at higher frequencies than buses, thus lowering the la- 
tency. 
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From a latency prospective, using a bus or a network is a trade off be^ 
tween the number of modules connected to the interconnect (which affects 
access time), the speed of the interconnect, and the network diameter 

Data format In roost modem bus interfaces the data fbnnat is de- 
fined by separate wire groups for the transaction types, address, write data, 
read data, and return acknowledgmenis/errore (e.g., VC3, OCP, AMBA, or 
CoreConnact). This Is used to pipeline transactions. For example, concur- 



write transaction can be sent, and the dais from an even earlier read trans- 
action can be received* Moreover, having dedicated wire groups simplifies 
the transaction decoding there is no need for a mechanism to select be- 
tween different kinds of data sent over a common set of wires. 

Inside a network, there is typically no distinction between different 
kinds of data. Data is treated uniformly, and passed from one router to 
another. This is done to minimize the control overhead and buffering in 
routers. If separate wires would be used for each of the above-mentioned 
groups, separate routing, scheduling, and queuing would be needed, in- 



In addition, in a network at each layer in the protocol stack, control in- 
formation must be supplied together with the date (e.g., command type, 
address, or data size). This control information is organized as an envelope 
around the data. That is, first a header is sent, followed by the actual data 
(payload), followed possibly by a trailer. Multiple such envelopes may be 
provided for the same data, each carrying the corresponding control infer* 
mation for each layer in die network protocol stack [6,24J, 
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Buffering and flow Control. Buffering data of a master (output 
buffering) is used both for buses and NoCs to decouple computation tram 
communication. However, for NoCs output buffering is also needed to 
marshal data, which consists of (a) (optionally) splitting the outgoing data 
in smaller packets which are transported by the network, and (b) adding 
control information for the network around the data (packet header). To 
avoid output buffer overflow the master must not initiate transactions that 
generate more data than the currently available space. 

Similarly to output buffering, input buffering is also used to decouple 
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computation from communication* In a NoC, input buffering is also re- 
quired to unraarshal data. 

In addition, flew control for input buffers differs for buses and NoCs. 
For buses, the source and destination are directly linked, and, destination 
can therefore signal directly to a source that it cannot accept data. This 
information can even be available to the arbiter, such that the bus is not 
granted to a transaction trynig to write to a full buffo. 

In a NoC, however, the destination of a transaction cannot signal di- 
rectly to a source that its input buffer is fulL Consequently, transactions 
to a destination can be started, possibly from multiple sources, after the 
destination's input buffer has filled up. Two policies can be adopted when 
an input buffer is fill. The first is not to accept additional fa^oming transi- 
tions, and to store them in the network. However, this approach can easily 
lead to network congestion, as the data could be eventually stored all the 
way to the sources, blocking the Knks in between. The second approach is 
to accept incoming transactions at a iixll destination, and drop some data 
in the input buffer. Congestion is avoided but data is lost, which is unde- 
sirable. 

To avoid output buffer overflow connections can be used, together vyrth 
end-to-end flow control. At connection set up between a master and one 
.or more slaves, buffer space is allocated at the network imer&ces of the 
slaves, and the network interface of the master is assigned credits reflecting 
the amount of buffer space at the slaves. The master can only send data 
when it has enough credits for the destination slave(s). The slaves grant 
credits to the master when they consume data. 



As described in the previous two sections, NoCs have different prop- 
erties from both existing off-chip networks and existing on-chip inter- 
connects, As a result, existing protocols and service interfaces cannot be 
adopted directly to NoCs, but must take the characteristics of NoCs into 
account For example, a protocol such as TCP/IP assumes the network is 
lossy, and includes significant complexity to provide reliable commirnica- 
tion. Therefore, it is not suitable in a NoC where we assume data transfer 
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reliability is already solved at a lower level On the other hand* existing 
on-chip protocol? such as VOL, OOP, AMBA, or CoreConnect are also not 
directly applicable. For example, they assume ordered transport of date: 
if two requests aie initiated from the same toaster, they will arrive in tha 
same order at the destination. This does not hold automatically for NoCs. 
Atomic chains of transactions and end-to-end flow control also need spe- 
cial attention in aNoC interface, 

Our objectives when defining our network services are the following. 
First, the services abstract from die netwotk internals as much as possible! 
This is a key ingredient in tackling the challenge of decoupling the com- 
putation from communication [14,22], which allows IPs (the computation 
part), and the interconnect (the communication pare) to be designed inde- 
pendently ftom each other. As a consequence, our services are positioned 
at the transport layer in die ISO-OSI reference model [24], which is the 
first layer to be independent pf the implementation of the networfc. 

Second, we aim at aNoC interface as close as possible to a bus inter- 
face. NoCs can ten be introduced non-disroptively: with minor changes, 
existing IPs, methodologies and tools can continue to be used. As a conse- 
quence, we use a request-response interface, similar to interfaces for split 
buses [1,12, 13, 17,25]. 

Third, our interface extends traditional bus interfaces to fully exploit 
die power ofNoCs. For example, we offer connection-based communica- 
tion which does not only relax ordering constraints (as for buses), but also 
enables new communication properties, such as end-to-end flow control 
based on credits, or guaranteed throughput [8, 19, 20]. All these properties 
can be set fbr each connection individually. 
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A, The /Ethereal Connection and Transaction Model 

IPS interact with our network [8, 19, 20] at so-called network interfaces 
(Ni). Nis provide Nl ports (mpytbrongh winch the communication services 
are accessed. As shown in Figure % a NI can have several NIPS to which one 
or more IPS (computation elements or memories, but not interconnection 
elements) can be connected. Similarly, an IP can be connected to more than 
one Nis and nips, 
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Figure 3. Examples of links between ms and ips. 



Co Tnnnmi cation between nips is performed on connections. Connec- 
tions are introduced to describe and identify communication wife different 
properties, such as guaranteed throughput, bounded latency and jitter, or- 
dered delivery, or flow control For example, to distinguish and indepen- 
dently guarantee communication of lMbs and 25Mbs f two connections 
can be used. Two NQ?s can be connected by multiple connections, possi- 
bly with different properties. Connections as defined here ate similar to the 
concept of Oreads and connections from OOP and VCL Where in OOP and 
VCI connections are used only to relax transaction ordering, we generalise 
from only the ordering property to include configuration of buffering and 
flow control, guaranteed throughput, and bounded latency per connection. 

^Ethereal connections most be cmated with the desired properties before 
being used. This may result in resource reservations inside the network 
(e.g^ buffer space, or percentage of the link usage per time unit). If the 
requested resources are not available, the network will reflis s the request 
After usage, connections are closed, which leads to freeing the resources 
occupied by that connection. 

To allow more flexibility in configuring connections, and, hence, better 
resource allocation per connection, the outgoing and return parts of con- 
nections are configured separately. For example, different buffer spaca can 
be allocated in the AWP and PKlPs, respectively, or different bandwidths 
can be reserved for requests and responses. 

Depending on the requested services, the time to handle a connec- 
tion (i.e., creating, closing, modifying services) can be short (e.g., creat- 
ing/closing an unordered, lossy, best-effort connection) or significant (e.g., 
creating/closing a multicast guaranteed-throughput connection). Couse* 



. . quendy,H5onnections are assumed to be created, closed, or modified infra- 



quentfy, coinciding eg, with reconfiguration points, when the application 
requirements change. 

Communication takes place on connections using transaction^ Consist- 
ing of a request and a possibly response. The request encodes an operation 
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Figured transaction composition, 

<e.g^ read, write, flush, test and sat, nop) and possibly carries outgoing 
<Jata(e,g., for write commands). The response returns data as a result of a 
command (e.g., read) and/or an acknowledgment. 

Connections involves at least two nips* Transactions on a connection 
are always stoned at one and only one of the nips, called the connection's 
active nip (aNIP). All the other nips of the connection are called passive 

NIPff(PNlP). 

There can be multiple transactions active on a connection at a time (as 
for split buses). That is, transactions can be started at the ANIP of a connec- 
tion while responses for earlier transactions are pending. If a connection 
has multiple slaves, multiple transactions can be initiated towards different 
slaves. Transactions are also pipelined between a single pair of a master 
and a slave for both requests and responses, in principle, transactions can 
also be pipelined within a slave, if the slave allows this. 

A transaction is composed ftom the fbUowing massages (see Figure 4): 

• A command message (cmd) is sent by the anip, and describes the 
action to be executed at the slave connected to the pnep. Examples 
of commands are read, write, test and set, and flush* Commands 
are the only messages that are compulsory in a transaction- For 
nips that allow only a single command with no parameters (e,&, 
fixed^ize address-less write), we assume the command message 
still exists, even if it is implicit (i.e., not explicitly sent by the IP). 

• An out data message (outdata) is sent by the anif following a 
command that requires data to be executed (e.g-, write, multicast, 

and test-and-set). 

• A return data message (retdata) is sent by a pnip as a cause- 
quence of a transaction execution that produces data (e,g., read, 

and test-and-set). 

• A completion acknowledgment message (rbtstat) is an Optional 
message which is returned by PN*P when a command has been 
completed. It may signal either a successful completion or an er- 
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rar. For transactions including both RETdata and retstaT the 
two messages can be combined in a single message for efficiency. 
However, conceptually, they exist both: retstat to s ignal die 
presence of data or an error, and retdata to cany fl*e data. In 
bos-based interfaces Retpata and retstat typically exist as two 
separate signals [1, 12, 13, 17,25]. 

Messages composing a transaction are divided in outgoing messages, 
namely cmd and outdata, and response messages, namely rbtdata, 
RETSTAT. Within a transaction, CMD precedes all other mess&ga$, and 
RBTDATA precedes RETSTAT if present These rules apply both between 
master and anip, and pntp and slave. Examples of transactions are shown 
in Figures. 

We classify connections as follows (see Figure 6): 

• A simple connection is a connection between one anip and one 
PNIP, 

• A narrowcast connection is a connection between one aw and 
one or more pnips, in which the a|W initiates transactions that 
are executed by. exactly one pnip. An example of the narrow- 
cast connection is shown in Figure 7 9 where the anip performs 
transactions on an acjdress space which is mapped on two mem- 

ory modules, Depending on the transaction address, a transaction 

is executed on only one of these two memories, 

• A multicast connection is a connection between one anip and 
one or more pnips, in which the sent messages are duplicated and 
each Pnip receives a copy of those messages, In a multicast con- 
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Figure 7- Ananowcaat 
connection. 



nection no return messages are currently allowed, because of the 
large traffic they generate (Le* one response per destination). It 
could also increase the complexity in the ajw because individual 
responses from pnifs must be merged into a single response for 
the ANlP. This requires buffer space and/or additional computa- 
tion for the merging itself. 



B> Connection Properties 

In this section we describe the properties that can be configured for 
4 connection: guaranteed message integrity, guaranteed transaction com- 
pletion, various transaction ordering^ guaranieed throughput, bounded la- 
tency and jitter, and connection flow control. 

Data Integrity. Data integrity means thar the payload of Ike message 
is not changed (accidentally or not) during transport. We assume that data 
integrity i$ already solved at a lower layer in our network, namely at the 
link layer, because in current on-chip technologies dm can be transported 
uncorrupted over links. Consequently, our network inter&ce always gua*. 
antees that messages are delivered uncorrected at the destination. 

transaction Completion. A transaction without a response is said to 
be complete when it has been executed by the slave. As there is no response 
message to the master, no guarantee regarding transaction completion can 



— — . -rw-rww HU. 103 r.^J/Di 

PHNL02IQ31EPP ^ ^33 08.10.2002 18? 12 



08.10.2002 

18 

&adnlescn and Goopsens 





a 




master 




ANIP 




d 






network 








master 


PN1P 




c 




b* 




PNIP 


master 







Figure*. Mi^^ ordering is obseryable at a, b d c,andd. 
be given, 

A transaction with a response is said to be complete when a rbtStat 
message is received from the awp 1 . The transaction may either (a) be 
executed successfully, m which case a success kbtstat is rammed, (b) 
fell in to execution at the slave, and then an execution emir hetstat is 
returned, or (c) fail because ofbuffer overflow in a connection with no fiow 
control, and then ic reports an overflow error, 

In our network, rooters do not drop data [20], therefore, massages are 
always guaranteed to be delivered at the ni. For connections with flow 
control, also Nis do not drop data. Thus, message delivery to the IPs is 
guaranteed automatically in this case. 

However, if there is no flow control, messages may be dropped at fee 
network interface in case of buffer overflow (see the paragraph on end-to- 
end flow control below). All of Cmd, outdata, and RSTOata may be 
dropped at die Hi. Ib guarantee transaction completion, rbtstaT is pot 
allowed to be dropped Consequentty, in the anips enough buffer space 
must be provided to accommodate retstat messages for all outstand- 
ing transactions. This is enforced by bounding flie number of outstanding 
tr ansactions . 



Transaction Ordering. In this section, we describe the ordering re- 
quiremems between different transactions within a single connection. Over 
different connections" no ordering of transactions is defined at the transport 
layen 



* Wo assume that when data is received as a response (RETDaTa), a RBTStAT 
implicit) is also received to valitjaja the dam. 



PHNL021O31EPP 



034 08.10.2002 18:13:11 



08.10.2002 



COMMT0KTCATION SERVICES EOfc NOC& 



id 



there several points in a connection where order of transactions can be 
Observed (see Figure 8): (a) the order in which the master presents CMD 
messages to the aNip, (b) the order in which the CMDs are delivered to the 
slave by the pnip, (c) the order in which the slave presents the responses 
to the pne?j and (d) the order the responses are delivered to the master by 
the anip. Note that cot all of (b), (c\ and (d) are always present More- 
over, there are no assumptions about the order in which the slaves execute 
transactions; we can only observe the order of the responses. Wo consider 
the order of the transaction execution to be a system decision, and not a 
part of the interconnect protocol. 

At both anip and pnips* outgoing messages belonging to different 
transactions on the same connection axe allowed to bo interleaved. For 
example, two write commands can be issued, and only afterwards their 
data follows. If the order of OUTDATA messages differs from the order 
of cmd messages* transaction identifiers must be introduced to associate 
OUTOATAS with their corresponding CMD, 

Outgoing messages can be delivered by die pnips to die slaves (see 
Figure 8-b) as follows: 

• Unordered, which imposes no order on the delivery of the outgo- 
ing messages of different transactions at the pnips. 

• Ordered locally* where transactions must be delivered to each 
FNiP in the order they were sent, but no order t$ imposed across 
pnips. Locally-ordered delivery of the outgoing messages can be 
provided either by an ordered data transportation, or by reordering 
of outgoing messages at the pnip. 

• Ordered globally, where transactions must be delivered in die or- 
der they were sent, across all PNIPS of die connection. Globally- 
ordered delivery of the outgoing part of transactions require a 
costly synchronization mechanism. 

Transaction response messages can be delivered by the slaves to the 
pnkps (see Hgure 8-c) as follows: 

• Ordered, when retdata and retstat messages are returned in 
the same order as the CMOS were delivered to the slave. 

• Unordered, otherwise. 
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tags attached to messages for transaction identifications (similar to tags in 
VO). 

Response messages can be delivered by the ANjp to the master (see 
Figure 8-d) as follows: 

• Unordered, which imposes no order on the delivery of responses, 
Hfeie, also, tags must be used to associate responses with their 
corresponding CMos. 

• Ordered locally f where retdata and retstat messages of trans- 
actions for a single slave are delivered in the order the original 
cmds were presented by the master to die anip. Note that there is 
no ordering imposed for transactions to different slaves within the 



• Globally ordered* where all responses in a connection are deliv- 
ered to the master in the same order as the original cmds. When 
transactions are pipelined on a connection, then globally-ordered 
delivery of responses requires reordering at the ANIP. 

AU3x2x3 = 18 combinations between the above orderings are pos- 
sible. Out of these, we define and offer the following two. An unordered 
connection is a connection in which no ordering is assumed in any part 
of the transactions. As a result, the responses must be tagged to be able 
identify to which transaction they belong. Implementing unordered con- 
nections has low cost; however, they may be harder to use, and introduce 
the overhead of tagging. 

An ordered connection is defined as a connection with local ordering 
for the outgoing messages from pnips to slaves (Figure S-b), ordered re- 
sponses at the pnips (Figure 8-c), and global ordering for responses at the 
ANIP (Figure 8-d). We choose local ordering tor the outgoing part because 
the global ordering has a too high cost, and has few uses. The ordering of 
responses is selected to allow a simple programming model with no tag- 
ging. Global ordering at the ANIP is possible at a moderate cost, because 
all the ordering is done locally in the anip. 

A user can emulate connections with global ordering at die PNIPS using 
non-pipelined acknowledged transactions. 
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Connection latency, throughput, and jitter. In oar network, 
throughput can bo reserved for connections in a rime-division multiple ac- 
cess (TDMA) fashion, whore bandwidth is split in fixed-size slots on a 
fixed time frame. Bandwidth, as well as bounds on latency and jitter can 
bo guaranteed when slots are reserved. They are aH defined m multiples of 
the slots. 

Gu^ranteed-throughptit connections can overbook resources in some 
cases. For example, when an amip opens a guaraateed-thioijgbpui read 
connection, it most reserve slots far the read command messages, and for 
the read data messages. The ratio between the two can be very large (o.g„ 
1:100), which leads either to a large number of slots, or bandwidth being 
wasted for the rod command messages. 

To solve this problem, we allow the request and response parts of a 
connection be configured independently for all of throughput, latency and 
jitter. Consequently, the request pare of a connection can be best effort, 
While the response can have guaranteed throughput (or vice versa). For 
the example mentioned above, we can use best effort read messages, and 
goaranteed-throughput readrdata messages. No global connection guaran- 
tees can be offered in this case, but the overall throughput can be higher 
and more stable than in the case of using only best-effort traffic. 

Connection flow control. As mentioned earlier, our network guaratv 
tees that messages are delivered to the Nt Messages sent from one of the 
nips are not immediately visible at the other nip, because of the multi-hop 
nature of networks. Consequently handshakes over a network would allow 
only a single message be transmitted at a time. Ibis limits the throughput 
on a connection and adds latency to transactions. To solve this problem, 
and achieve a better netwoik utilization, the messages must be pipelined. 
In this case, if the data is not consumed at the pnip at die same rate it 
arrives, either flow control must be introduced to slow down the producer, 
or data may be lost because of limited buffer space at the consumer ni. 

We introduce end-to-end flow control at the level of connections, which 
requires buffer space to bo associated with connections. End-to-end flow 
control ensures that messages are sent over the network only when there is 
enough space in the nip's destination buffer to accommodate them. 

End-to-end flow is optional (ie., to be requested when the connections 
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opeaiocD and can be configured independently fox the outgoing and re- 
turn paths, When no flow control is provided, messages ara dropped when 
buffos overflow- Multiple policies of dropping messages are possible, as 
in off-chip networks- Possible scenario* include: Ga) the oldest message is 
dropped (milk policy), or (b) the newest message is dropped (wine pol- 
icy) [24]. 

We opt for a credit-based flow control, Credits ate associated with the 
empty buffer space at the receiver Nr. The sender's credit is lowered as data 
is pent, When data is delivered at the receiver nip, credits are granted to 
the sender. If the sendees credit is not sufficient to send some data, the W 
at the sender stalls the sending. 
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C. Use Cases 

To illustrate the need for differentiated services on connections, we 
show in Ms section some examples of traffic We describe the properties 
they would use over an /Ethereal connection to meet their traffic require- 
ments. 

Video processing screams typically require a lossless, in-order video 
stream with guaranteed throughput, but possibly allow corrupted samples, 
An Ethereal connection for such a stream would require the necessary 
throughput, ordered transactions, and flow control. If the video stream is 
produced by the master only write transactions are necessary. In such a 
case, with a flow-controlled connection there is no need to also require 
transaction completion, because messages are never dropped, and the write 
command and its data are always delivered at the destination. Data in- 
tegrity is always provided by our network, even though it may be not nec- 
essary in this case. 

Another example is that of cache updates which require unconupted, 
lossless, low-latency data transfer, but ordering and guaranteed through- 
put are less important In such a case, a connection would not require any 
time related guarantees, because a low latency, even if preferable, is not 
critical Low latency can be obtained even with a best effort connection. 
The connection would also require flow control and guaranteed transac- 
tion completion to ensure loss-less transactions. However, no ordering is 
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necessary, because this is sot important for cache updates, and allowing 
out of order transaction can reduce the response time. 



V. Conclusions 

In this p aper, we compare networks on chip (NoC) to off-chip networks 
(e.g., computer networks) and existing on-chip interconnects (e.g., busses). 
We show that NoCs have many similarities with off-chip networks. How- 
ever; they also differ, especially in their resource constraints. For example 
on a chip, memory and computation resources are more expensive, while 
there are more wires. This make s NoC architecture* different from off-chip 
networks, and requires rethinking of network services. 

We also compare NoCs to existing on-chip interconnects, such as buses 
and switches. By directly connecting ip blocks, existing on^-chip intercon- 
nects can offer ti^it coupling between masters and slaves, and global ar- 
bitration. In NoCs, masters and slaves are completely decoupled* and the 
arbitration is distributed over the network nodes. This make it harder to 
provide guarantees, such as bandwidth lower bounds, and transaction or- 
dering?- 

We define a set of NoC services that abstract from the network details. 
Using flies© services in the IP design decouples computation and communi- 
cation. We use a request-response transaction model to be cl ose to eari sting 
on-chip interconnect protocols. This eases the migration of current EPS to 
NoCs. Tb lolly utilize the NoC capabilities, such as high bandwidth and 
transaction concurrency, our services provide connection-oriented com- 
munication, Connections can be configured independently with different 
properties. These properties include transaction completion, various trans- 
action ordering, bandwidth lower bounds, latency and jitter upper bounds, 
and flow control. . 

Our services are a prerequisite ft* service-based system design whicn 
makes applications independant of NoC implementations, makes de- 
signs more robust, and enable* areMtccture-inckpendentqnaHty-of-service 
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Abstract 

Continuing VLSI technology scaling raises seven! deep 
submiaran (DSN) problems like relatively stow intercon- 
nect power dissipation and distribution, and signal in- 
tegrify. Those problems are encountered particularly on 
long Wires jbr global interconnect As cloak frequencies in- 
crease, scaled -wires become relatively slower, and on-chip 
communication will be the limiting performance factor of 
future chips. We explain yvhy efficiently sharing of the wires 
for longdistance communication is the solution to this prob- 
lem. We introduce networks on silicon (Na$), that route 
packets over shared (semO-gtobal wires. NbS performance 
is expected to be higk but comes at a cost. Balancing tlw 
performance and cost of a NoS is a major challenge, and 
>vu believe busses still have a role play. 



1 Technology trend 

VLSI technology scaling has long followed Moore's law. 
Jfe fundamental barriers have bean identified that invalidate 
this law for at least another decade [12]. Moore's law pre- 
dicts ihal chips in 2010 win count over 4 billion transis- 
tors, operating in the multi-GHz range. This abundance of 
transistors will moke very complex systems on silicon (SoS) 
possible. 

However, challenges of ell abstraction levels of design 
win have to be addressed before such SoSs will become a 
reality. The three most important deep snbmicron (DSM) 
challenges, related to ell abstraction levels, are: substantial 
wire delay, c ontmliing power delivery and dissipation, and 
assuring signal integriry. 

Until recently, on-chip wiring was cheap. Consequently 
architectural models hove been employed that relied on low- 
latency communication to globally shore expensive compu- 
tational resources. Global wire delay stays at best constant 
ander technology scaling and hence these wires become ef- 
fectively slower compared to a gate delay. For example, 
for 130 nm technology the reachable distance of a repeated 
global signal in a clock cycle is no more than the length of a 




Figure 1. The number of 50k blocks for future 
process technologies. 



chip [41- For 50 nm technology, crossing a chip with highly 
optimized interconnect takes between six end ten clock- 
cycles, clearly invalidating the low-latency assumption of 
today. Hence we must move to system-level architectures 
that scale with technology. 

A feasible template for a future-proof architecture is con- 
structed from processing nodes that do not grow in com* 
plexfty with technology. Instead, as technology scales, the 
number of chose processing nodes on the chip grows. An 
on-chip communication network then combines these nodes 
into a SoS (4]. 

Various publications show that the spanning wires in 
blocks of 50k gates scale with technology [4, 13}. This 
means that the aforementioned^ DSM issues can be handled 
by CAD tools, assuming their evolutionary improvement. 
Figure I shows the exponentially increasing amount of snch 
50k blocks for a large die in subsequent technologies; in 
35 nm this nnrnber is armrotfraarery ten thousand (adapted 
from [13] and [4]). Ic remains to find a cotrnmnnicotion or* 
chitecturo that allows a SoS composed of these blocks co- 
operate efficiently. 

2 Networks on silicon are inevitable 

Given the growing demand for and impact of intercon- 
nect on system cost and performance, It is worthwhile to op- 
timize the utilization of wires. Ad-hoc global wiring 
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lures often lead to a huge number of wires with an aver- 
age usage as tow as HMS In time [2], To control cost in 
this scenario, the wire packing density must be very high, 
Which is not bcnegcial for the power and delay characterise 
tic*. Efficient mechanisms for sharing (sexnQ-global wires 
must splvc this cosi-perfbjxnanee dilemma. 

In deep submlcron technologies, <setni>glabal wires 
need special attention for power, signal-integrity, and per- 
formance reasons, Tn the discussion below we- show how 
special circuit techniques can handle these issues. Such 
techniques only work, however, when embedded in ded- 
icated communicadon IP, which provides a more abstract 
interface. 

Power is an issue for global interconnect because it costs 
mora energy lo send a bit of information over longer the 
wires. To reduce ihe communicatiQn delay, the energy con- 
sumption increases due to bigger drivers. Employing low- 
swing signaling for the global wires saYt$ up to a factor four 
in power for these wires [15]. Implementing low-swing sig- 
naling requires special circuit techniques. 



pacitive and inductive coupling between wires. Capacitive 
noise coupling is the result of the large aspect ratio of wires 
in D$M technologies. Inductive noise coupling becomes 
more of a problem due to the decreasing transition times, IR 
drop 1 in the supply distribution increasingly contributes to 
the noise. The most effective way to moke a connection ro- 



Differential signaling improves both the generation of and 
sensitivity to noise. 

Hie signal propagation delay of an uninterrupted wire 
grows quanratfcally with its length; hence from a certain 
length onwards it is advantageous to partition the wire in 
segments with repeaters in between. The repeater insertion 
technique improves bandwidth and latency but at the cost of 
higher power consumption, Wire delay can be reduced by 
fat wires with a lower resistance per unit length at the cost 
of lower wire density. Such wires behave lite lossy trans- 
mission lines and require drivers with a resistance matched 
to the transmission line. 

As a result, we believe That all inter-block communica- 
tion will be implemented by hard-macro transmitters and 
receivers, employing low-swing differentia! signaling, with 
wen-controlled interconnect instead of ad-hoc drivers han- 
dled by standard place-and-route tools. In this way, commu- 
nication links can be realized with predictable performance 
and DSM robustness. 

Currently, the prevalent on-chip interconnects are 
busses [1]. In a bus architecture, devices sham a single 
transmission medium to communicate. At a given time, 
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Figure 2, Structural view of a network on sil- 
icon consisting of processing nodes (P) and 
nodes supporting communication (R, B). 



only one device has access to the shared medium. An ar- 
bitration mechanism is required to order simultaneous ac- 
cesses. Such functionality is typically performed by a cen- 
tralized bus arbiter. The performance of a shared-medium 
bus scales badly. For an increasing ruirnber of bus clients 
0) individual clients got less bandwidth on average, and (£) 
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bandwidth. 

A solution that pairs scalable communication perfor- 
mance and minimal interconnect cost is expected from net- 
works on silicon (NoS) where the SoS is considered as a 
network of components (& 3, 1]. Figure % illustrates the 
hardware architecture or this concent Ihe outer compo- 
nents (marked P) exclusively perform processing and stor- 
age functions, whereas the inner components (marked B and 
R) form die NoS and cater to cornmunication needs of the 
enter components. The basic building blocks of a NoS are 
routers (R), 

A router forwards data from its input ports to its out- 
put ports in a concurrent fashion, To that end, a router of 
arity JV contains a N x JV switch matrix. Data packets 
pake their way through the network based on the routing 
information in their headers. A link between two routers is . 
implemented by a point-to-point connection. The links typ- 
ically span medium to long distances ranging from several 
to over more than twenty nnTIimelers. Hie actual length de- 
pends on the chosen topology of the network. For a mesh 
topology the links are relatively short, for a torus which is 
a mesh with wrap-around connections, some links have a 
length of half the edge of the chip. Links can be optimized 
for bandwidth, latency, power, or a combination of these, 
depending on performance requirements. 

3 NoSrapiireme&ts 

An important characteristic of a future system-level ar- 
chitecture is the separation between computation and com- 
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nxunicgtian. A NoS allows the computational blocks id 
commonicato with one other via a uniform interface. A 
unifona interface is advantageous because (J) ft Titos the 
core develops from having to make assumptions about the 
system in which the coro win be ijsed, and ©) does not 
constrain the development of newer commnnicadon archi- 
tectures by detailed interfacing requirements of particular 
legacy SoC components [61 Several on-chip bus standards 
are evolving to realize this goal, most notably VC3, put fop- 
ward by VSIA 1 1 4], and more recently, the Open Core Pro- 
tocol [10]. 

The fundamental aim of a NoS 4s to provide flexible and 
efficient communication between the thousands of IP blocks 
in a system, with performance guarantees* In a typical SoS, 
the communication, (temgnrfg of different IP blocks show 
large variations. For example, data rates may bo constant 
(e.g. digital video) or variable (e.g. compressed video}. The 
importance of latency and jitter also varies greatly. Finally, 
the data granularity may range from single words to large 
blocks, A NoS should be able to offer different services to 
different clients. Each service class must be implemented 
efficiently, using a shared uniform infrastructure. 

A high utilization of the network comes at a price. When 
the network starts to saturate, throughput and latency win 
show huge variations, which is not acceptable in real-time 
applications. Hence, the network should also provide guar- 
anteest like loss-loss dam transport, minimal bandwidth, 
and bounded latency, The way packets are buffered and 
scheduled In routers, and the effects on performance guar- 
antees has been the subject of intense research. Funda- 
mentally, sharing and guarantees are conflicting, and effi- 
ciently combining guaranteed traffic with best-effort traffic 
is hard [11]. Although best-efrort services are cheaper than 
guaranteed services we believe that the latter are essential 
because they enable compositional and scalable integration 
qf the IP blocks [5]. It is up to the IP integrator at design 
time, and up to the application at ran time, to make a trade 
off. 



4 Performance and cost analysis of NoSs 



Ute vision of previous sections is that the design of fu- 
ture SoSs will allow IP blocks to be plugged in at will to 
minimize corrrnronication costs, but without today's prob- 
lems like tirning closure. In this section we investigate the 
cost implications of system design based on a NoS. We hope 
the vision comes at acceptable cost. "We hope mat the over- 
all cost of a NoS, includmg the full protocol stack to use it, 
turn out to be acceptable such that the integration blessings 
of NoSs do not change into a cost nightmare. 
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4.1 Performance 

The aggregate bandwidth of a router is the product of tho 
bandwidmperport, BWpert, the arity of the router (nurribeT 
of ports). N 9 and a utilisation factor, a <1 corresponding 
to the router arbitration scheme. 

BWjjnUr = a iV BWptnt (1) 
We- discuss each in turn, Hie bandwidth per port is deter- 
mined by the bandwidth of the link and the router data path. 
In short: 

BWpcn B xniafSF^, BW^^^^) (2) 

where B is tho width of the dam path. The combined band- 
width of the J? wires of a link is a function of tho layout 
characteristics (e,g. total length), chosen signaling tech- 
nique, and the budgets for power, delay, and area, a first- 
order expression for the bandwidth of a repeated global wire 
optimized for power-delay is 

where F04 is the delay of an inverter driving four equally 
sized inverters [4]. In a 100 run technology, this yields 5 
Gb/s per wire under worst-case environmental conditions. 
Notice that the bandwidth of repeated global wires scales 
with technology because such wires allow (wave) pipelining 
at the segments. 

Ru n n ing the router data path at 5 GHz is not feasible, An 
aggressive but realistic frequency is 1.25 GHz correspond- 
ing the clock frequency of 50k gates blocks [4J. The critical 
function in ma data path is the JV x JV switch, For JV~ up 
to 20 it meets the 1.25 GH2 data rate, using N 1-out-of-JV 
multiplexors. The relaxed demand on the wires of the link 
can be used to reduce power dissipation and area. 

The utilisation factor, or, reflects the errectrveness of 
the router to resolve contention on the links. Hie queu- 
ing strategy, the queue sizes, and the schedule algorithm all 
strongly fnflnnn ne a. Accordingly, many queuing policies 
and scheduling algorithms have been presented in the liter- 
ature. For example, a = 0.59 for infinite fifo Input queued 
with uniform and independent traffic, (Virtual) output queu- 
ing gives or = 1 under the same conditions, but at the 
cost of larger queues and a more complex scheduling algo- 
rithm [gj. Static scheduling techniques like (time-division- 
multiplexed) circuit switching can also improve the utiliza- 
tion factor. 

Hence, in 100 nm technology, the bandwidth of a 32 bit 
router port is approximately 5 GBytc/sec, 

4*2 Cost 

Three main components contribute to the area cost of a 
router the switch, the control logic, and the packet queues. 
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Hie switch allows N simultaneous connections from the 
jV inputs to me JV outputs which results in 5awys of Nx 
N wiits, giving rise to an OiF 2 ) area cost 

Tbc control logic of a muter is made up of the switch- 
rnanix schedule unit and other configuration logic. The 
delay of a schedule cycle varies greatly per algorithm 
(for example, for virtual output queuing tan 0(1) to 
0{N*t*) M): it is important for two reasons- First, it de- 
termines the lower bound for latency that a flit 2 incurs to 
traverse the route* Second, it affects the size of die queues. 
The longer a schedule cycle, the mora data arrive, given a 
fixed bandwidth of a port BW V art- This leads to deeper 
queues, and higher area cost 

The three aforementioned queuing strategies require 
queues of size 0(N) to 0(JV*) flits. Scheduling alg orithms 
perform better with deeper queues, with a decreasing return 

Besides routers, a significant amount of area is consumed 
by so-called network interfaces (NO modules. These mod- 
ules translate the IP transactions for a given connection to 
packets that are sent over the network and vice versa, Pack- 
ets can be sent once the payload has been completely ac- 
cepted by the NL Hence, me buffers must be dimensioned 
such that, at least a complete packet for every simultane- 
ously active connection can be stored. 

The trade off between utilization a and the cost is a com* 
plex one, but of importance to me viability of NoS s . 

5 The future rote of busses 

In sections 1 and 2 we have argued that NoSa are essen- 
tial to solve SoS integration in a scalable fashion, While 
Section 4.2 raised some general cost issues, we will now 
more concretely consider the trade off between busses 
andNoSs. Will pacte-switched Note completely replace 
current busses in future SoSs, or will a hybrid approach 
emerge? We believe mat shared busses may have a role 
to play in first-level communication (B in Figure 2) for the 
following reasons. 

Hist* typical IP blocks undcrutilize the bandwidth ca- 
pacity of an individual rou ter port AD router ports offer the 
same bandwidth that is inherent to the architecture, whereas 
the bandwidth requirements Of IP blocks varies greatly. A 
shared memory module needs typically much higher (peak) 
bandwidth than a su»caming peripheral device, Single word 
transfers, variable bit rates, bursty IO y and much lower clock 
rates for IP blocks man for (he NoS further waste band- 
width, T!ub means that the communication needs of a num- 
ber of IP blocks can be aggregated using a bus before (he 
capacity of a network link is reached 

Second, network interfaces axe more expensive (in terms 
of area) than a bus adaptor. Using a bus as a first-level traf- 

2 Ffil Spamte for flow Cflfltofl tfgn. atomic pration cf daia handled 
pa* schedule cydc A. perfect (s deeeaoffisedin flits* 
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Figure 3. A shared-medium f)U3 seems a cost- 
effective way to connect the IP to the packed 
switched network. 



fie concentrator, trading bus adaptors for network interfaces 
thus reduces the overall coat of IP-NoS interfacing. Wo ex* 
pect thai the overhead of a bus and its ne iwori interface are 
outweighed. 

Finally, the number of routers is reduced significantly 
when busses are used as the first-level Interconnect. Routers 
are larger than busses due to their packet queues and more 
complex scheduling; We give an example below* 

An example of the heterogeneous communication archi- 
tecture is depicted in Figure 3. A router of arity three sur- 
rounded by twelve IP bocks is shown. Two shared-medium 
busses, each connected to six 50k gates IP blocks, commu- 
nicate with the router via two network interfaces. These 
have two functions: first they schedule the transactions on 
the bus, and second they given the bus clients access to the 
packet-switched network. Tbo third port of the router pro- 
vides communication to the remainder of me network. Fig- 
ured shows an architecture using only routers. Now three 
routers of arity five and one of arity four are needed. 

The suggested shared-medium bus has a length of 35kAi 
where X is half of the length of a minimal transistor. Global 
wires of this length will not be the bottle-neck of bus per- 
formance, 3 

The feasibility of hybrid NoSs hinges on the right irnple- ■ . 
mentation of the busses. First, they must be shared wires, 
as opposed to switches. Second, their arbitration must be 
combined, or at least compatible with, rite scheduling taking 
place in the network interfaces, to offer uniform end-to-end 
network services. 

We see a future for hybrid NoSs, with firtUlevel commu- 
nication over a shared-medinmbus, and thehigherlevels us- 
ing a packet-switched network. Perhaps a packe^switched 
network can be seen as a distributed and scalable implemen- 
tation of a logical bridge that connects all the local busses of 
the SoS, Deciding how many IP blocks can use a local bus 

^Mjnlmwu-ddny wirp segments tec a tejtfl& of SSkX, wi» sesmtott 
Optixnkad^pawer-dcJny pmdpa hav&alei^of 48k*. There kagritf 
spate vuth technology like u» edge of 50k fcteefcs [4]. 
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Figure 4. IP to IP communication based on a 
homogeneous router network. 



before connecting to die router network is a quesuon that 
must be answered foremost 

6 Conclusion 

We have argued in Section 1 thai future systems on sal- 
oon (SoS) will be composed of large numbers of process- 
ing nodes (or IP blocks). Bach processing node is rela- 
tively small (50k gates) to scaje with technology, mad can 
be handled by CAD tools, assuming their evolutionary im- 
provement The interconnect and communication between 
these block? then becomes en essential function in itself 
(Section 2), leading to networks on silicon (NoS), A NoS 
is based on packet switching to flexibly share link capacity 
between the network clients, and to provide plurifcnn com- 
munication services over a uniform infrastructure- Boih ef* 
ficienoy, provided by best-effort traffic, and predictable per- 
formance* such as guaranteed throughput and latency, are 
important (Section 3 ). Efficiently combining them is a chal- 
lenge. Section 4 showed that the performance of a NoS de- 
pends on many factors, but is expected; to be high. The cost 
of a NoS can be state} in terms of area (routers, network in- 
terfaces), utilization of wires, and Speed (latency). They can 
bo traded off against one another* but also, perhaps more in- 
terestingly, against the cost of busses, A hybrid NoS using 
aharcd-wire busses to communicate locally, and accumulat- 
ing traffic for a core router network is a promising architec- 
ture that deserves to be investigated. 
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ABSTRACT 

Managing tbs complexity of designing chips containing billions of 
requites decoupling computation firan cotnmuidcation. 
For the communication, scalable and compositional interconnects 
(such as networks on chip (NoC)) must be used. In this paper wo 
show that guaranteed services are essential in achieving fids de- 
coupling. Guarantees typically come at die cost of inefficient re- 
source mfltatum To achieve efficiency, they must be used in con> 
bmatian with best-effort services. We describe a NoC architecture 
that efficiently combine? guaranteed and bestaffbrt services. The 
key element of out NoC is a router consisting conceptually of two 
parts; the so-called guaranteed throughput (GT) and best-effort (be) 
routers. Both offer dam integrity, lossless and in-order dam deliv- 
ery. Additionally, the GT router offers guaranteed throughput and 
latency services. We combine the GT and BE router architectures ef- 
ficiently by sharing rooter resources, enabling high link utfliEfttfnn. 
The guarantees are never affected by the be traffic, and links are 
efficiently utilized because DB traffic uses all bandwidth left over 
from OX waffle Connections are programmed using BE packets. 
The Tnegrairuning model is robust, concurrent, and clstribated It 
enables run-time and compile* time, deterministic and adaptive con- 
nection management. For an our architectural choices, we show the 
trade oris between hardware complexity and efficiency, and moti- 
vate our choices. 

1. INTRODUCTION 

Recent advances in technology raise the challenge of managing 
the complexity of designing chips containing hflKnns of transistors. 
A key ingredient in tackling mis challenge is decoupling the com- 
putation from commumctpton P, IS). This decoupling allows IPs 
(die computation part), and the inicrconnect (the communication 
part) to bo designed independently from each other. 

In this paper, we focus on the communication, part. Existing in- 
terconnects (e,g„ busses) may no longer be feasible for chips With 
many IPs, because of tie diverse and dynamic communication re- 
quirements. Networks on a Chip (NcC) are emerging as an alter- 
naixva to existing on-chip interconnects because they (a) structure 
and manage global wires in new deep-submicrtm technologies [2, 
3, 4* 61, (b) share wires, lowering their number and increasing their 
utilization [4. 6], (o) can be energy efficient and reliable [2] 9 and 
(d) are scalable when compared to tradibbnal bosses [7], 
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Decoupling the computation from eomiuuMcation requires that 
services mat IPs use to communicate (a) are well-defined, and 
(b) hide the implcrnentation details of the interconnect [9] , see 
Figure 1(a). NoCs again help, because they are traditionally do* 
signed using layered protocol stacks [14], where each layer pro- 
vides a wen-defined interface which decouples service usage from 
service inrplemem^arjon [15, 3]. 

In particular, guaranteed services are essential because they 
make the requirements on the NoC explicit, thus limiting the possi- 
ble interactions (a stricter contract) of IPs with the communication 
environment. As a result, IP design is simpler. IPs can also be 
designed independently, because their guaranteed services are not 
affected by the mtercoimect or by other IPs. This is essential for a 
compositional construction (design and programming) of systems 
on chip. Moreover for guaranteed services, failures are restricted 
to the IP configuration phase (a service request is earner granted or 
d emed by the NoQ which simplifies the IP programming mode [6] . 
We view the guaranteed services to be offered fay an interconnect 
as a requirement from the atn^jcatjons, seeRgnre t(b). 

The drawback of using guaranteed services is that they require 
resource reservation for worst-case scenarios. As a consequence, 
resources may not be efficiently utilized, which may not be ac- 
ceptable in a system on a chip where cose constraints are typically 
very tight, see Figure 1(c). Ib overcome this problem, best-effort 
services can be used for less critical communication requirements 
to fully utilise the available resources. Using best-effort services, 
however, provide no guarantees. 

A compromise between using guarantees only and having an ef- 
ficient interconnect is to combine guaranteed and best-effort ser- 
vices. Guaranteed traffic Should not be affected by best*effort traf- 
fic, while best-effort traffic may use all the resources not used by 
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Figure If Network services (a) hide me fntcreonnerfaefaJlsanjd 
allow reusable components to be build on top or Cham, (b) are 
driven by the application requirements, (c) their efficiency re- 
lies en technology and network organization, and (d) are build 
usjrtg a layered approach. 
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guaranteed traffic. Guaranteed services would then be used for the 
critical (raffle requirements, and best-effort services for non-critical 
traffic requirements. 

Xntids paper, we first Ma set of network-independent coioarnu- 
nation services thai are essential is chip design. In the following 
sections, we show die trade-offs between efficiency and coat that 
we make in oar NoC. In Section 3, we present the trade ofik and 
take decisions on network-related issues. In Section 4, wo zoom 
into the internals of the key component of our NoC: a router which 
efficiently combines guaranteed and best-effbrt services. 

2. SERVICES 

The increasing complexity of integrated circuits, and the strong 
tjme~to-rnar£et pressure require modular designs and IP reuse. De- 
coupling computation from cennmpticafion in chip design serves 
both the?© two requirements [9]. This decoupling is realized by 
defining communication interfaces that provide well-defined ser* 
vices and Wde Cue rmplernentaiion details of the interconnect 

We show in Section I, that guaranteed services are essential to 
simplify IP design and integration. Examples of such guaranteed 
services ate data inregrify, which assumes the data is delivered un- 
corrupced, lossless data ddto&y, which means no data is dropped 
in the interconnect, in-order data delivery, which specifies that 
the order in which data is delivered is the same order in which it 
has been sent. Other guarantees offer tixoe-ralaied bounds, such as 
throughput and latency. 

Guarantees require resoprce reservation for worswas© scenar- 
ios, which can be expensive. For example, guaranteeing through- 
put for a stream of data implies reserving 'bandwidth for its peak 
throughput, even when its average is much lower. As a conse- 
quence, when using guarantees, resources art often underutilized. 

Resources are better utilized when best-effort traffic is used. 
Best-effort services do not reserve any resources, and hence provide 
no guarantees. As a consequence, their perfonnance is dictated by 
boundary conditions, such as interconnect load. Bar example, a 
connecticm may become temi^rartly lossy in a congested network, 
if the network resolves congestion by dropping data. 

Bestreffort services use resources weUbecauwthfifyarp typically 
deigned for average-case scenarios as opposed to worst-case sce- 
narios. They are also easy and fast to use, as they require no re- 
source reservation. Their main disadvantage is their narar edfetahi^ 
ity: one cannot rely ana given perforrnance Q.ew they do m>t offer 
guarantees): In the best case, if certain boundary conditions are 
p^Pflerff, a statistical performance can he derived. 

The requirements for guaranteed services and the efficiency con- 
fltorint (good resource utfltah'nn) are conflicting. But a first step 
to a predictable and low-cost interconnect is combining the guar- 
anteed and best'-effbrt services in the same- interconnect Guaran- 
teed services would be used for critical traffic requirements, and 
best-effort services for non-critical traffic requirements. For exam- 
ple a video processing IP wtH typically require a lossless, in-order 
video stream With guaranteed ihrcughpnt, bur possibly allows cor- 
rupted samples. Another example is cache updates which require 
uneofirtpted, lossless, low-latency data transfer, bus ordering and 
guaranteed throughput are less important. In Section 44 we show 
how combining guaranteed and bestreffort services efficiently uses 
common resources. In the remainder of this section wo analyze the 
minimum level of abstraction at which the comrnumcariou services 
must be offered to hide tho network internals. 

TraoltianaBy, network services have been implemented and of- 
fered using a layered protocol stack, typically aligned to the ISO- 
OSI reference model [14]. NoCs also take this approach (2,3,6. 
15], because it structures and decomposes the service implementa- 



tion, and the protocol stack concepts aid posflfi^goggggces. 

To achieve the decoirpKng of computation from cemDmrocatioa, 
the corrrrrromcation services must be offered at least at the level 
of the transport layer in OSI reference modeL It is the first layer 
that offers end- torrid services, hiding the network details; see Fig^ 
urel<d){3]. 

The lowest three layers in the protocol stack, namely physical, 
data-link and network layers, ore network specific. Therefore, these 
services should not be visible to the IPs if decoupling between com- 
putation from communication is desired. However, these layers are 
essentjaj in implementing the services, because constructing guar-* 
antees without guarantees at the layer below is cither very expan- 
sive, or even impossible. For example* implementing a lossless 
commnnicaticnon top of a lossy service requires actotowledgmerit, 
data retransmission, and filtering duplicated data. This leads to a 
significant increase in traffic and also a trade of? between large 
buffer space requirements and long delays. Even worse, rrovicfing 
guarantees for time-related services is impossible if lower layers 
do not offer those guaraniees. For example, throughput can not be 
guaranteed if cornmurrication at a lower layer is lossy. As a con- 
sequence, guarantees can only be built on top at guarantees, see 
Figure 1(b). Similarly, a layer's efficiency is based on efftcient im- 
plementations of the layers below it, see Figure 1(c). 

The NoC services that we consider essential for chip design are: 
data integrity, lossless data delivery, in-order delivery, throughput, 
and latency. Data integrity Is always guaranteed. AH the other 
services can be guaranteed or not, depending on request, In the 
next section, we describe briefly how these services are provided 
by our NoC and in Section 4 we describe in detail how our router 
architecture enables an efficient implementation of these services. 

3. NETWORKS ON CHIP 

Currently, the prevalent on-chip interconnects are bosses and 
switches [10]. These are single-hop interconnects, meaning that 
there is no storage in the interconnect itself. Scalable interconnects 
require multiple hops with storage in every hop (router). This in- 
troduces a rrtrmber of new issues, which we discuss in this section. 

General computer network research is a mature research 
field [16] which has many issues in common with NoCs. How- 
ever, two significant differences between computer networks and 
on-chip networks make the trade offc in their design very differ- 
ent [4J. First, routers of a NoC are more resource constrained than 
those in computer network, in particular in the control complexity 
and in the amount of memory. Second, cxmrnuwicatfon links of a 
NoC are relatively shorter than those in cQsnpurer networks, all ow- 
ing tight syncbror&Htion (network Bow control) between rooters. 

These two characteristics have a direct impact on the NoC ser- 
vice implementation. In a NoC it is possible to solve the data in- 
tegrity at (he data-link layer as alow cost We, therefore, assume it 
solved at tha network layer and higher. Lossless transport of data 
is guaranteed by our renters. However, to allow corisurnfirs slower 
thanraxxiuceis, the network may be allowed to drop data at its edge. 
Consequently, the designer may choose either for (a) a lossless con- 
nection <Ia, implementing end-to-end flow control), or (b) a lossy 
connection (Le- without flow control). In-order delivery is again 
guaranteed by our router (Le„ routers do not reorder flata batwesn 
a given inpnt port and a given output port). EnoVto-end ordering 
of data, however, has to he provided on top of this at the network 
edge when data is transported on cTfifereat rouiBs with, different de- 
lay s. Offering guaranteed and best-effort throughput and latency 
services is also implemented by the routers. These router services 
together with The programming model exrjlninfid in Section 432 
otter network trrroughpnt and latency services. 
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work wdritBctare. These are: the switching mode, routing, con- 
tortion resolution, and network fiow control Equally important; 
end-to-end fiow control and congestion control urn handled incur 
NoC at the network edge instead of the routers; we meref ore omit 
their discussion hero. 

3.1 Switching Mode 

The switching mode of a network specifi es haw datt and control 
axe related We distinguish circuit switching and packet switching. 

In circuit switching data and control ate separated, Hrsriha con- 
trol provided co the network (connection set up). Ibis results in 
ac?rc»i* over which all subsequent data of *e connection Is trans- 
ported. In timfrdwsion switching bandwidth is shared by time- 
elvisian multiplexing connections over circuits. Orciiifrswitckri 
networks inherently offer time-related guaranteed services when 
resources are reserved during the connection set up. 

In packet switching data is divided into packets and every packet 
is composed of a control part (the footer)* and a data part (the pay- 
toad). Network routers inspect, and pogjbly modify, the headers 
of incoming packets co switch the packet to the appropriate cut 
put port. Since In paefcet switching the packets are self contained, 
there is no need fox a set-up phase to allocate resources, Best-€£ffort 
services are meretonatnndly provide 

3.2 Routing 

Routing is the dfitermination of the rente (or path) that the data 
follows from source co c^taiation. There are two basic approaches: 
source routing and destination routing. In scarce touting, the net* 
work injerfece at the source computes the complete ronte to me 
destination. In destination routing, only the network address of 
the destination to specified, and every rov^ selects the ^rciaiatB 
output based on the address, We refer to [17] for several classes of 
routing ftinctjons. 

In drtniri switching, rout^ 
once for all data in that comiection. In packet switch 
done for every individual packet sent over the network. In both 
cases, source and destination routing are possible, We currently 
consider source routing becanseitismdependent of tbsroito 
work topology, which is not yet determined. 

3J3 Contention Resolution 

_ When a router attempts to send mul t i pl e dam items over the same 
liflkatthe sametinreca^ swd to occur. As only one darn 

item cm occupy a link at any point in time a selection among the 
contending data must be made; this process is called contention 
resolution. Three approaches exist avoio^ contention, dropping 
data (one of the contending data itemis txansmitsedand *e remain* 
der are delete*, and schediihng (or seonentiaJi2ing) data (all data 
items are sent in tnm; some data items are therefore delayed), 

fil circuit switching contention resolution takes place at s et up 
at the granularity of connections, so that data sent over different 
connections do not conflict. Thus, there is no contention during 
data transport, and rim e-related guarantees can be given. 

jnpacl^switchirig contend 
ularity of individual packets. Dropping packets is possible, but for 
a lossless service (a) it adds conmlesity to too network (acknowl- 
edgmems, retnnsinissloii, eta), and (b) it ultimately increases the 



data is the only remaining option. 
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Network flow control, also called routing mods deals with the 
Emited amount of buffering in routers and data acceptance between 
routers. In circuit switching connections are set up. The data send 
over these connections is always accepted by the routers and hence 
no network flow control is needed In packet switching, data must 
be buffered at every too ter before they are sent on. Because routers 
have a limited amount of buffering they accept data only when they 
have enough space to score die incoming data. 

There aro three types of flow control, namely stone aruf forward, 
virtupl cut'ihrough, and wormhole routing. In store-arid-forward 
routing, an input packet is received and stored in its entirety before 
jr fa f oTwarded to the next rowan This rc^miM storage for Ae com- 
plete packer, and implies a porouter latency of at least the time 
required tor the router to receive rho packet 

In virtual cut-through routing a packet is forwarded as soon as 
the next router guarantees that the complete packet will be ac- 
cepted. Only when no guarantee is given, the whole packet is stoned 
in (he route Thus, virtual cut-trough routing requires buffer space 
for a complete packet, like store and forward routing, but allows 
lower-latency communication. 

In wormhole routing packets are split in so-called^» (flow con* 
trol digits). A flit is passed to the next router when Oat router 
accepts that flit, even when there is not enough buffer space for the 
complete packet. As soon es a flit of a packet is sent over an output 
port, that output port is reserved for flits of that packet only. When 
the first flit of a packet is blocked the trailing flits can therefore 
be spread over multiple routers, blocking the intermediate links. 
Wormhole routing requires the least tnrffering (buffer flits instead 
of packets) and also allows low-latency communication. However, 
ir is more sensitive to deadlock and generally results in lower link 
mihzajioq than virtual cut-through routing. 



is one of our targeted slerviees, and becan^itbasthejwestcostm 
terms of bnfCering, which is expensive on-chip. 

4. A COMBINED GT-BE ROUTER 

Section 2 defines our requirements for NoCs in terms of services 
that are to be offered, m jjartie^ 

services. The previous section introduces a number of genera! net- 
working issuer mat v^h^ built upon here. In the following two 
subsections we show that the guaranteed and bestreffbit services 
can conceptually be described by two independent router architec- 
tures The combination of these two router orchitecftrres is effi- 
cient and has a flexible pregrainming model, as described in Sub- 
section 43. 

44 A GT Router Aitiiiterture 

Our guaranteed-throughput (gt) comer must guarantee racer- 
runted* lossless and ordered data transfer, and both throughput and 
latency over a finite time interval. As mentioned earlier, data in- 
tegrity Is solved at the data-link layer: wo do not address it further; 
yfr data fcAttppttd by the gt router because we use a variant of cb> 
cult switching (described in the next section). Dam is transported 
in fixed-size blocks, further explained below. As only one block 
is stored per input in the GT router, blocks remain ordered We 
now tnre to the more chaUen 
throughout and latency. 

4.1.1 Time-related Guarantees 

Latency is defined as the time a packet spends in the network. 
Guaranteeing .latency, therefore, means that a worstcase Upper 
bound must be given for this tune. Here we define throughput 
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for a given prothicefrcan&xxmsr pair as the Amount of data Clans- 
ported by the network over a finite, fixed time interval Guarantee- 
ing throughput means giving a lower bound 
* Wo observe that guaranteeing latency in a lossless router is dif- 
ficult because contention requires scheduling and hence delays, 
Guaranteeing throughput is less problematic Rate-based pacjcet 
switching (for an overview see [18]) offers guaranteed throughput 
over a finite period, and hence a latency bound- This bound i$ very 
high, however, sod the cost of buffering is also high. Deadline- 
based packet switching [13] offers preferential treatment for pack- 
ets close ]a their deadline. This allows differential latency guaran- 
tees (under certain admissible traffic assumptions), but also at high 
buffer costs. 

Circuit switching solves the contention at set up, so naturally 
providing guaranteed latency and throughput. Circuits can be 
pipelined to improve throughput £5], at the cost of additional 
buffering and latency, Timo-drvision multiplexing connections over 
pipelined circuits additionally offers flexibility in bandwidth allo- 
cation. Una requires a notion of router synchrontairy, which is pos- 
sible because a NoC is better controllable than a general network 
We explain this variation in more detail in the next subsection. The 
associated programming model is described in Section 4,3.2, 

4.1,2 Contentiorirfree Rowing 

A footer uses a slot table to (a) avoid contention on a link, (b) 
divide up bandwidtb per link, and (c) switch data to the correct out- 
put. Every slot table # has S fixed-size tims slots (rows), and N 
router outputs (columns). There is a logical notion of synchronic- 
iry; all routers in the network are in the same slot. In a slot $ at 
most one block of data can be read/<vrite per input/output port The 
next slot (e-f-lJJSS, the read blocks axe written to their appropriate 
output ports. Blocks thus propagate in a store and forward fashion. 
The latency a block incurs par router is equal to the duration of a 
aloe Bandwidth is guaranteed in multiples of block size per 3 slots. 

The entries of the slot table map outputs to inputs for every slot; 
J2(s,o) b= i. An entry is empty, when there is no reservation for 
that output in that slot. No contention arises because there is at moat 
one input per output Sending a single input to multiple outputs 
(multicast) is possible. 

The slots reserved tor a block along its path ten source to desti- 
nafion increase by ons (modulo 5). If stot$ is reserved in a rooter, 
slot (s + must be reserved in the next router on the path. 
- The assignment of slots to conneciians in the network is an opuV . 
miration problem, and is described in Section 4.3 3: Section 433, 
explains now slots arc reserved in the network, by means of best- 
cfXbrt packets. 

4*2 A BE Router Architecture 

Best-effort (SB) traffic can have a better average performance 
than offered by guaranteed services. This depends on boundary 
carnations, such as network load, that are unpredictable, Best- 
effort services thus nHfiH our efficiency requirement, but without 
off e r in g time-related guarantees, This section describes an archi- 
tecture for a best-effort service with uncorrupted, lossless, in-oider 
data transport. 

The router efficiency is influenced by both its complexity and 
its utilization. In Section 3 we have justified our choice for rout- 
ing (source routing) and network flow control (wonnhole). Now 
we determine the contention resolution schema that is used. It has 
two components: buffering and scheduling. Our router prototypes 
show that the buffering costs dominate the cost of the router, The 
main trade off in Section 4vZl is therefore between buffer costs and 
link utilization, which are both critical resources. For the chosen 
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buffering strategy an efficient scheduling el|)g^ greeted in 
Section 4£2, trading off link utilization and schedule complexity. 

42.1 Buffering Strategy 

Tne buffering strategy determines the location of buffers inside 
the router. We distinguish input queuing* output quoting* And vir- 
tual output queuing. In the following, JV Is the number of inputs 
(equal to the number of outputs) of the router. We behove that in 
a balanced solution the rates at which routers and links operate is 
equal Slower routers require more buffering, and faster routers are 
not feasible as links operate at high speed 

In input queuing there is a single queue par input, resulting in 
the lowest buffer cost (iV logical queues in N physical memories) 
of oil three approaches. However, due to die so-called fwad»of- 
line blocking, for large N network utilization saturates at 59% [8]. 
Therefore, input queuing results in weak utilization of the links, 

Oui£ut queuing can increase- the link utilization to 100% by hav- 
ing queues at each output, or N 2 queues, with as many physical 
memories. It is better to have fewer larger, memories than mora 
smaller memories because the overhead of small RAMs is very 
high. Overcjocfcing the ranter by a factor J\T to use iV memories 
is not possible, as argued previously. So the number of memories 
depends qnadraticaUy on TV, hence output queuing is not scalable. 

Virtual output queuing [1] (voq) combines the advantages of 
input queuing and output queuing. It has the buffering complexity 
of mpnt queuing and the link unTi^onof oim^qttenhjg. Asfor 
output queuing, there are 2V 2 logical queues, but they are combined 
in N physical memories ax the inputs as for input queuing. For 
every input £ tuere are N queues Qfc o), one for each output o, see 
Figure 2. There is at most one write to these queues. The difference 
between output and voq is the additional constraint mat there can 
be at most cno read from mis group of N queues. (This enabled the 
mapping of all input queues of the same input to ons memory.) This 
additional constraint has to be, taken into accaivnt hy tihm grh^njjng. 
100% link utilization can stU) be achieved, when W" is huge [12]. 

We select VOQ because it combines high link utilization with 
moderate buffer costs. 

4££ Matrix Scheduling 

This section shows how link contention and memory contention 
(imposed by voq) are resolved. Matrix scheduling solves both 
kinds of contention by ensuring that every voq memory is read at 
most once, and every output dint) is -written to at most once. The 
scheduling problem can be modeled as a bipartite graph matching 
problem as follows. Every input port a is modeled by a node u* and 
every output port o by a code v 0 . There is an e4ge between m and 
v 0 if and only if queue Q(i,o) is non-empty. A matdi is a subset 
of these edges each that every node is incident to at most one edge, 
For example, Rgure 3(c) is a match of Figure 3(a). The number of 
edges in the match is its size; a match is maximal When no edges 
can be added to it. A maximum size match is a largest size match. 

Although optimal, there are two reasons not to consider only 




Figure 2; Schematic of a router using virtual output queuing. 




Figure 3: The three pteps of a single iSUP iteration* 

maximum dzo matches. First, maximum size matching algorithms 
have 0{N* /2 ) complexity Since matrix scheduling Is done at flit 
iaie not feasible for lame JY. Second, ma^dmum size mateh^ 
fag algorithms can be unfair, which can result in starvation, ie., 
some queues are never served. 

There are several matchiii^ algorithms; see III] for a thorough 
discussion. We select the iterative SUP (tSLOO matrix scheduling 
algorithm because it has a low complexity, avoids starvation, 
pad provides increasing performance as die number of iterations 
grows, It reaches a rnaxirr^ match Even a 

single iteration considerably outperforms input queuing, and can be 
efficiently implemented in hardware. Multiple iterations increase 
the latency cf the control path, and hence the flit size (as explained 
in {Section 43.1), We consider using 1-SLIPbecau^m 
ations give only marginal improvement 

A single iSUP iteration has three s taps, illustrated by an example 
in Figure 3 for iV = 4- In the first stage, see Figure 3(a), every non- 
empty queue o) requests access to output port o from input 
port i. In the second stage, see Figure 3(b), every output port o 
grants one request, solving lir& contention at the output ports. In 
the third stage, see Figure 3(c), every hqnitportt<icc^OMfffflrt^ 
to resolve memory contention at the input port. Wc extend iSUP 
to take network flow control into account. 

43 Combining the GT and BE Routers 

The OT and BE router architectures are combined to share re- 
sources. In particular the links, memories* and switches. Moreover, 
best-effort traffic enables a pnekeb-based prog ramming model for 
the guaranteed traffic, as shown later, in Section 4,3.2. 

The principal constraint for a combined router architectro 
gaaranteed services are never affected by best-effort services. Fig- 
ure 4(a) shows that, conceptually, the combined router contains 
both router architectures (mt fines represent data transport, thin 
lfcesrer^esentttntrri 

ther the GT ortha be router. The GT traffic (the traffic that is served 
by the router) has the higher priority, to maintain guarantees. 
This is ensured by the arbitration unit, which therefore affects the 
besfreffbrt scheduling, Furtberrnore, best-effort packets can pit> 
gram the guaranteed router, as shown by the snow labeled pro- 
gram. Thin lin.es going fiem the right to the left indicate network 
flow control, wbteh is only required for best*ffart packets because 
guaranteed blocks never encounter contention. 

On a shared link only one BE or data item can arrive or be 
sent at any point in lime. Thus Qt and bb memories can be shared. 
Keeping the number of memories at N", withiV-fiV 2 logical queues 
intotaL Figure 4(b) shows that the data path consisting of memo- 
ries ard switch maixU is sbar^ 

and OTfomeis are separate, yet interrelated. Moreover, the arbitra- 
tion unir of Figure 4(a) has been absorbed by the BB router. The 
following subsection shows how this can be done. 

43.1 Arbitration and Flit Size 
When combining GT and BB traffic in a single network the in> 
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(a) conceptual view (M hardware view 



Kgrrre 4t Two views of the combined GT-BE router. 

pact on Che network flow control scheme must be taken into ac- 
count. Recall from Section 34 that a BE flit is the smallest mat at 
which flow control is performed- In other words, the BE scheduling, 
using sSLJP, can only react to OT blocks at fltt granularity. Tb avoid 
alignment problems, the block size (# words) is a multiple of the 
flits (F words, with Bt=> ZF).t is constant; we prefer a small £ and 
F to decrease the ste^and-forward delay tier guaranteed traffic. 

We extend iSUP to handle the combination of OT and BB traffic. 
In this combination gt traffic always has priority over BE traffic. 
This is to ensure that guarantees are never corrupted. 

4.3.2 Programming Model 

In this section we show how GT connections are set up and torn 
down by means of be packets. To ensure scalabilit y , prograonmng 
must not require global or centralized resources. Section 4.1.2 ex- 
plains why our cortfeution-rree routing uses slot tables; wo now see 
that they are distributed over routers for scalability. 

Initially the slot table of every router is empty. This means that 
OT connections can only be set up using BE packets, unless an ad* 
cfitional commnnicafian infesmjeture is introduced solely for pro- 
gramming, TWo special packets, Reset and Start, are used to reset 
and Stan the Not respectively. They progress by flooding, and are 
not subject to the usual network flow control. We will not discuss 
them farther. There are three system packets*. SetUp, TearDown, 
and AckSetUp. They are used to program the slot table in every 
router on their path. 

The SetUp packet is used to create a connection from, a source 
to a destination, and travels in the direction of the data ("down- 
srream"). AckSatUp acknowledges a successful set up, and flows 
upstream The TearDown packet destroys (partially) existing con- 
nections, and can travel in either direction, Setup packets contain 
the source of the data, the path to their destination, and a slot num- 
ber. Every router along the path of the Setup packet checks if the 
output to the next router to the path is ires m the slot indicated by 
the packet If it is tree, the output is reserved in that slot, and the 
SetUp packet is forwarded with an incremented (modulo 5) slot. 
Otherwise, the SetUp packet is discarded and aTaerDown packet 
returns along the same path. Thus every path must he reversible; 
this is the only assumption we make about the network topology. 
These upstream TearDown packets ftea the slot, and amtinue with 
a decremented slot Downstream TearDown packets work similarly, 
and remove existing connections. A connection is successfully cre- 
ated when an AckSetUp is received, else a TearDown is recedved- 

The programming model is pipelined and concurrent (multiple 
system packets can be active in the network simultaneous)^, also 
fcom the same source) and distributed (active in multiple routers). 
Given the distributed nature of the prograrmrung model, en suring 
consistency and determinism is cruciaL The outcome of program-, 
nting may depend on the execution order of system packets, but is 
alway S consistent. The next section shows how to use the program- 
ruing model* 
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4.33 SlotAttocaUon 

This section explains ways to determme the dote specified in 
Setup packets. A slot allocation for a single connection requires 
that, at every router along the path, tbe required output is free in the 
appropriate slot. Therefore, interference of SetUp packets of mul- 
tiple connections can be completely avoided if connections are set 
up with conflict-free slots or paths. All execution orders of Setup 
packets tlten give the same result. 

Computing an optimal slot allocation is complex mid requires a 
global network view. It can be used only for small problem in- 
stances, lb reduce computational cost, heuristics can be used, but 
tins probably leads to non-optima] solutions* Oompflo-tiine slot al- 
locations from both approaches can be recreated detenxxmisticaily 
at ran tune, concurrently and distnbutgdly (because all SetUp pack- 
ets are conflict-free). 

At ran time, a global view requires a centralized slot allocation. 
This impairs scalability and slows 4own programming. Run-time 
distributed slot allocation is scalable, but lacks a global view. This 
typically results in subop&nal slot allocation. Moreover, SetUp 
packets may interfere, malting programming mom involved, and 
perhaps norwieteniunistic. However, dynamic connection man- 
agement at high rates win require distributed slot allocation, in 
a simple distributed greedy algorithm, all sources repeatedly gen- 
erate random slot numbers for each set up until their connection 
succeeds. 

We conclude that our programming model allows both compile- 
mne and run-time slot allocation. Cc^iputational complexity, de- 
terministic results, and scalability can be balanced according to sys- 
tem requirements. 

5. CONCLUSIONS 



of transistors requires decoupling computation from ccanrnunxca- 
don, For communication, networks on chip (NoQ are emerging 
as an alternative for existing interconnects to solve technological, 
performance, and scalability problems. 

In this paper we show that guaranteed services are essential to 
provide predictable interconnects that enable compositional system 
design and integration. However, guarantees typically ntiifow j©. 
srjurces inefficiently. Best-effort services overcome this problem 
but provide no guarantees. So, combining guaranteed and best- 
eiTort'serviccs allows efficient resource utilization, yet still provid- 
ing guarantees for critical traffic. 

Time-related guarantees, such as throughput and latency, can 
only be constructed on aNoC f}mt intrinsically baa these jnoperties. 
Wo therefore define a router-based NoC architecture that combines 
guaranteed and test-effort services* Thus, the router architecture 
has conceptually two parts; the guaranteed throughput (gt) and 
best-effort (be) ranters. Both offer data integrity, lossless data de- 
livery, and in-order data delivery. Additionally, the ©T rouxer offer? 
guaranteed throughput and latency services using pipelined circuit 
switching with time-division multiplexing. Tins requires a notion 
of syncbrnnicjty! at each time slot at most one block of data is com* 
rnunicated over a link- The gt router has low latency and moder- 
aie memory requircrnents. The BE router uses packet a witching, 
worrahole routing, and virtual output queuing with iSLIP. The BE 
router has low latency, high HnV utilization, and mnHgr nr? memory 
requirements. 

We combine the GT and BE router architectures efficiently by 
sharing router resources. The. guarantees are never affected by the 
Bfi traffic, and links are efficiently utilized because BB traffic uses 
all bandwidth left over by GT traffic. Connections are progrnrnm ed 



using be packets. The progrennxing model j 
and distributed. It enables run-time and ci 
tic and adaptive connection mana ge m grt 'r- 

For all our architecture chrnces, we show the tradeoffs between 
hardware complexity and efficiency, and motivate our choices. 

In conclusion, wo describe and motivate a combined guaranteed 
and best-effort router, which is an essential component in a NoC. 
It fulfills cur requirements by providing giiararuccd services, and 
satisfies tie efficiency constraint by good resource utfliaation. 
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1 . Integrated circuit comprising a plurality of modules, and a network arranged 
for transferring messages between the modules, wherein a message issued by a 
module comprises first information indicative for a location of an addressed module 
within the network, and second information indicative for a location within the 
addressed module, 

characterized in that the first and the second information are arranged as a single 
address from which the network determines which module is addressed, and from 
which the addressed module determines which of its locations is selected. 

2. Method for exchanging messages in an integrated circuit comprising a 
plurality of modules, the messages between the modules being exchanged via a 
network, wherein a message issued by a module comprises first information indicative 
for a location of an addressed module within the network, and second information 
indicative for a location within the addressed module, 

characterized in that the first and the second information are arranged as a single 
address from which the network determines which module is addressed, and from 
which foe addressed module determines which of its locations is selected. 

3. Integrated circuit comprising a plurality of processing modules and a network 
arranged for providing at least one communication between a first and a second 
mdule, which communication channel supports transactions comprising outgoing 
messages from the firs* module to the second module and return messages from foe 
second module to the first module, characterized in that foe network manages the 
outgoing messages in a way different from foe return messages. 

4 Method for exchanging messages in an integrated circuit comprising a 
plurality of modules, foe messages between foe modules being exchanged via a 
network, wherein a conmnmicafion channel through foe network supports transactions 
comprising outgoing messages from the first module to the second module and return 
messages from foe second module to foe first module, characterized in that the 
network manages the outgoing messages in a way different from foe return messages. 

5. Integrated circuit according to claim 3, wherein foe network has a first mode 
wherein a message is Iransferred within a guaranteed time interval, and a second 
mode wherein a message is transferred as fast as possible with foe available resources, 
wherein foe outgoing transaction is a read message, requesting the second module to 
send data to foe first module, wherein foe return transaction is foe data generated by 
foe second module upon this request, and wherein the outgoing transaction is 
transferred according to foe second mode, and foe return transaction is transferred 
according to foe first mode. 
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6. Integrated circuit according to claim 3, wherein the network allows at least 
two of the following transaction modes unordered, locally ordered and globally 
ordered, wherein an unordered transaction mode of the network gives no guarantees 
for the order in which messages will arrive at their destination, a locally ordered 
transaction mode guarantees that messages sent to the same destination will arrive in 
the same order as they were sent, a global ordered transaction mode guarantees that 
messages will arrive in the same order as they were sent even if they are sent to 
different destinations, wherein outgoing and return transactions are handled according 
to different transaction modes. 

7. Integrated circuit according to claim 3, wherein the network reserves a first 
and a second buffer space for the first and the second module respectively, the 
bufferspaces having a mutually different size. 

8. Integrated circuit comprising a plurality of modules, which modules are 
arranged to communicate to each other via a network, wherein the network is 
arranged to distribute a message from a first module to two or more second modules, 
and wherein the second modules are arranged to generate an acknowledge message 
indicating receipt of the message from the first module, 

the network being arranged to generate a single return message to the first module, in 
dependence of the acknowledge messages of the second modules. 

9. Integrated circuit according to claim 8, wherein the single return message 
indicates that at least one of the second modules has received the message issued by 
the first module, 

10. Integrated circuit according to claim 8, wherein the single return message 
indicates that each of the second modules has received the message issued by the first 
module, 

1 1 . Int egrated circuit comprising a first plurality of processing modules and a 
network, the network comprising a second plurality of nodes and interconnections - 
between nodes, the network being arranged for transferring messages between a first 
and a second modules via a path through the network, the processing modules coupled 
to fiie network via a network interface having a buffer for receiving incoming 
messages, wherein a message from a first to a second module is not initiated until the 
buffer has sufficient space for receiving a return message from the second module. 
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Networks are emerging as a possible solution for ou>- 
chip interconnects. Ib this paper, we describe how net- 
works on chip (NoC) ate similar to and differ from 
both off-chip networks (e.g^ computer networks) and cur- 
rant on-chip interconnects (e.g., buses). We re-examine 
the communication services in the context of NoCs. We 
provide services th# abstract from network implementa- 
tions enabling a clean separation between the NoC and 
IP blocks. Wa define a request-response transaction model 
similar to bus protocols, making our approach back- 
ward compatible, To exploit the fiitt power of NoCs, 
we also provide connection-oriented communication with 
differentiated services. Examples are bandwidth guaran- 
tee^ transaction ordering?, and end-to-end flow control. 

Key Words: Neworks on chip, on-chip buses, computer 
networks, communication services, protocol stack, 
transaction, connection. 
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